- 1
- Or, more precisely, one token per line; i.e., CWB expects
punctuation marks, parentheses, quotes, etc. on separate lines. The
precise tokenization rules depend on your theoretical assumptions
and the requirements of annotation software such as part-of-speech taggers.
CWB does not include any components for any kind of tagging, and has to be
provided with a tokenized and annotated corpus.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 2
- The filesystem paths referred to in this manual are all Unix-style;
however, CWB on Windows works happily with Windows-style paths.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 3
- In previous versions of CWB, the default registry directory
used to be /corpora/c1/registry (for historical reasons). All
binary packages of CWB 3.0 and newer use the new default setting. If you
already have a working environment with the old registry path, you may
want to compile the CWB source code yourself, selecting the
classic site configuration.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 4
- Older versions of CWB - including the long-term “stable” 3.0 - only fully supported
ISO 8859-1. While it is possible, just about, to work with other charsets in CWB 3.0,
it is very strongly recommended that you upgrade to CWB version 3.4
to get full support for all ISO-8859- encodings as well as UTF-8.
While as late as the mid-2010s, there were corpus annotation programs in wide use
that generated ISO 8859 output, as of this writing UTF-8 is now finally the accepted
standard, and the recommended encoding for use with CWB. Nevertheless,
cwb-encode still defaults to Latin-1 (for backward compatibility with 3.0)
if no -c option is supplied; it is for this reason that we recommend always
specifying the charset explicitly.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 5
- By “available” we mean that the program in question must be
both installed on your computer, and findable to CWB. See Appendix C.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 6
- Previous versions of the CWB would default to the current working
directory in the absence of a -d.
As a result, simply typing cwb-encode on the command
line would litter this directory with a number of empty data files and then
hang, waiting for corpus data on the standard input.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 7
- Previous CWB versions had partial support for gzip-compressed input
and output files, indicated in the respective man pages.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 8
- Mnemonic: -L stands for
sentence Limits.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 9
- see e.g. https://universaldependencies.org/format.html and
format examples on this page
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 10
- Along with the -C option for charset cleanup; see section 2.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 11
- A feature-set attribute that is not declared as such at index-time
can still be treated as a feature set in CQP, but in this case
responsibility is with the user to ensure that the values are well-formed
feature sets.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 12
- Keep in mind that the option
-U "" has to be specified in this case in order to allow empty
strings as values.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 13
- awk is a standard Unix tool, not available on Windows by default,
and not to our knowledge easy to install. On Windows, therefore, you would
need to use some other program to process the corpus position data.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 14
- See Kermes and Evert (2002): https://www.aclweb.org/anthology/L02-1202/
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 15
- In order to re-create the original input file vss.vrt as a
well-formed XML document, it would have been necessary to store the full
strings of attribute-value pairs from XML start tags by using -V
flags instead of -S in the cwb-encode attribute declarations
(e.g. -V story:0+num+title). In the cwb-decode call,
problematic s-attributes created by auto-splitting of these attribute-value
pairs (story_num, story_title, s_len, np_head,
...) can then be omitted. The specification -S story would
print the full attribute-value pairs in <story> tags, etc.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 16
- There will be a few small differences due to escaping of XML
metacharacters and the omitted collection attribute.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 17
- Windows users be aware: the data in this command, and some of the
subsequent examples, is piped via the Unix tools
sort and head. On Windows a more typical approach would
be to redirect the output to file and then use some GUI program
(e.g. Notepad++, Microsoft Excel, etc.) to open the file and
manipulate the data.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 18
- More precisely, the quality score represents the weighted sum of features
shared by the aligned regions. Therefore long alignment beads will usually
achieve higher scores than shorter beads.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 19
- Lexicon size refers to the sum of the byte lengths of all annotation
strings in the lexicon, including NUL terminators.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 20
- Automatic, as if by magic.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.