1.3 Tutorial corpora referenced in this manual

- 1.3.0.1 English corpus: DICKENS
- 1.3.0.2 German corpus: GERMAN-LAW

1.3 Tutorial corpora referenced in this manual

Pre-encoded versions of the DICKENS and GERMAN-LAW corpora are distributed from the CWB website, http://cwb.sourceforge.net/download.php#corpora.

(Also available: a tool for encoding the British National Corpus 1994, http://cwb.sourceforge.net/download.php#import.)

1.3.0.1 English corpus: DICKENS

a collection of novels by Charles Dickens
ca. 3.4 million tokens
derived from Etext editions (Project Gutenberg)
document-structure markup added semi-automatically
part-of-speech tagging and lemmatisation with TreeTagger
recursive noun and prepositional phrases from Gramotron parser

1.3.0.2 German corpus: GERMAN-LAW

a collection of freely available German law texts
ca. 816,000 tokens
part-of-speech tagging with TreeTagger
morphosyntactic information and lemmatisation from IMSLex morphology
partial syntactic analysis with YAC chunker

See Appendix A.3 for a detailed description of the token-level annotations and structural markup of the two tutorial corpora (positional and structural attributes).