Subsections
1.3 Tutorial corpora referenced in this manual
Pre-encoded versions of the DICKENS and GERMAN-LAW corpora are distributed from the CWB website,
http://cwb.sourceforge.net/download.php#corpora.
(Also available: a tool for encoding the British National Corpus 1994,
http://cwb.sourceforge.net/download.php#import.)
- a collection of novels by Charles Dickens
- ca. 3.4 million tokens
- derived from Etext editions (Project Gutenberg)
- document-structure markup added semi-automatically
- part-of-speech tagging and lemmatisation with TreeTagger
- recursive noun and prepositional phrases from Gramotron parser
- a collection of freely available German law texts
- ca. 816,000 tokens
- part-of-speech tagging with TreeTagger
- morphosyntactic information and lemmatisation from IMSLex morphology
- partial syntactic analysis with YAC chunker
See Appendix A.3 for a detailed description of
the token-level annotations and structural markup of the two tutorial corpora
(positional and structural attributes).