...1
Or, more precisely, one token per line; i.e., CWB expects punctuation marks, parentheses, quotes, etc. on separate lines. The precise tokenization rules depend on your theoretical assumptions and the requirements of annotation software such as part-of-speech taggers. CWB does not include any components for any kind of tagging, and has to be provided with a tokenized and annotated corpus.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.../corpora/data/example.2
The filesystem paths referred to in this manual are all Unix-style; however, CWB on Windows works happily with Windows-style paths.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... locations3
In previous versions of CWB, the default registry directory used to be /corpora/c1/registry (for historical reasons). All binary packages of CWB 3.0 and newer use the new default setting. If you already have a working environment with the old registry path, you may want to compile the CWB source code yourself, selecting the classic site configuration.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...cwb-encode.4
Older versions of CWB - including the long-term “stable” 3.0 - only fully supported ISO 8859-1. While it is possible, just about, to work with other charsets in CWB 3.0, it is very strongly recommended that you upgrade to CWB version 3.4 to get full support for all ISO-8859-$x$ encodings as well as UTF-8. While as late as the mid-2010s, there were corpus annotation programs in wide use that generated ISO 8859 output, as of this writing UTF-8 is now finally the accepted standard, and the recommended encoding for use with CWB. Nevertheless, cwb-encode still defaults to Latin-1 (for backward compatibility with 3.0) if no -c option is supplied; it is for this reason that we recommend always specifying the charset explicitly.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...5
By “available” we mean that the program in question must be both installed on your computer, and findable to CWB. See Appendix C.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...6
Previous versions of the CWB would default to the current working directory in the absence of a -d. As a result, simply typing cwb-encode on the command line would litter this directory with a number of empty data files and then hang, waiting for corpus data on the standard input.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...7
Previous CWB versions had partial support for gzip-compressed input and output files, indicated in the respective man pages.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... option,8
Mnemonic: -L stands for sentence Limits.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... formats9
see e.g. https://universaldependencies.org/format.html and format examples on this page
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...10
Along with the -C option for charset cleanup; see section 2.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... sets.11
A feature-set attribute that is not declared as such at index-time can still be treated as a feature set in CQP, but in this case responsibility is with the user to ensure that the values are well-formed feature sets.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... string12
Keep in mind that the option -U "" has to be specified in this case in order to allow empty strings as values.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...awk13
awk is a standard Unix tool, not available on Windows by default, and not to our knowledge easy to install. On Windows, therefore, you would need to use some other program to process the corpus position data.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... YAC14
See Kermes and Evert (2002): https://www.aclweb.org/anthology/L02-1202/
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...15
In order to re-create the original input file vss.vrt as a well-formed XML document, it would have been necessary to store the full strings of attribute-value pairs from XML start tags by using -V flags instead of -S in the cwb-encode attribute declarations (e.g. -V story:0+num+title). In the cwb-decode call, problematic s-attributes created by auto-splitting of these attribute-value pairs (story_num, story_title, s_len, np_head, ...) can then be omitted. The specification -S story would print the full attribute-value pairs in <story> tags, etc.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...sec:cwb-corpora-xml:16
There will be a few small differences due to escaping of XML metacharacters and the omitted collection attribute.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...17
Windows users be aware: the data in this command, and some of the subsequent examples, is piped via the Unix tools sort and head. On Windows a more typical approach would be to redirect the output to file and then use some GUI program (e.g. Notepad++, Microsoft Excel, etc.) to open the file and manipulate the data.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...18
More precisely, the quality score represents the weighted sum of features shared by the aligned regions. Therefore long alignment beads will usually achieve higher scores than shorter beads.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...19
Lexicon size refers to the sum of the byte lengths of all annotation strings in the lexicon, including NUL terminators.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... automagic20
Automatic, as if by magic.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.