The standard CWB input format is one-word-per-line text,1 with the surface form in the first column and token-level annotations specified as additional TAB-separated columns. XML tags for sentence boundaries and other structural annotation must appear on separate lines. This file format is also called verticalized text and has the customary file extension .vrt. An example of the verticalized text format for a short sentence with part-of-speech and lemma annotations is shown in Figure 1. This file, as well as all other input files required by the following examples are made available in the accompanying data package.
In order to encode the file as a corpus, follow these steps:
$ cwb-encode -d /corpora/data/example -xsBC9 -c ascii -f example.vrt -R /usr/local/share/cwb/registry/example -P pos -P lemma -S s(The
$
character indicates a command line to be entered into your terminal.
It is inspired by the customary input prompt used by the Bourne
shells sh and bash.)
The first column of the input file is automatically encoded as the default positional attribute (p-attribute) named word. -P flags are used to declare additional p-attributes, i.e. token-level annotations. -S flags declare structural attributes (s-attributes), which encode non-recursive XML tags and whose names must correspond to the XML element names. By convention, all attribute names must be lowercase (more precisely, they may only contain the characters a-z, 0-9, -, and _, and may not start with a digit). Therefore, the names of XML elements to be included in the CWB corpus must not contain any non-ASCII or uppercase letters.
The -R option automatically creates a registry file, whose filename has to be written in lowercase. Note that it is necessary to specify the full path to the registry file, even if the default registry directory is used. The CWB name of the corpus (also called the corpus ID) is identical to the name of the registry file, but is written in uppercase (here it will be EXAMPLE). The CWB name is used to activate a corpus in the query processor CQP, for instance.
-xsBC9 is a cluster of options which switch on data cleanup procedures. They are, in order: recognise and handle basic XML features (-x); ignore any empty lines (-s); tidy up stray blank space characters (-B); remove characters that are invalid for the specified encoding (-C); silently discard unrecognised XML tags (-9). Most of the time, you would want to use all of these; the only time to omit them is when you are working with files that you know have no encoding or formatting problems (or if you use the new -n or -N formats; see section 3). Using -x and -9 does not preclude more complex XML; see section 5.
The -c option specifies the character encoding (or charset) of the input data. The example.vrt file does not contain any non-ASCII characters, so in this example we specify -c ascii. The other commonly used charsets are Unicode UTF-8 (-c utf8) and ISO 8859-1(-c latin1). We strongly recommend use of UTF-8 over ISO 8859 charsets wherever possible. A full list of charsets supported by CWB, and the corresponding single word labels used with the -c option, is available in the manual file for cwb-encode.4
Input files with the extensions .gz, .bz2 or .xz are assumed to be in the gzip, bzip2 and xz compressed formats, respectively. Such files are automatically decompressed (provided that gzip, bzip2 and/or xz are available).5
Multiple input files can be specified by using the -f option repeatedly. Files will be read in the order in which they appear on the command line. Shell wildcards (e.g. -f *.txt) do not work, since each file name must be preceded by -f. However, it is possible to read all files named *.vrt, *.vrt.gz, *.vrt.bz2 or *.vrt.xz in a given directory using the -F option (possibly repeated for multiple directories). The input files in each specified directory will be read in alphabetical order.
All options (-d, -f, -R, etc.) must precede the attribute declarations (-P, -S, etc.) on the command line. It is mandatory to specify a data directory with the -d option.6 This directory should always be given as an absolute path, so the corpus can be used from any location in the file system.
Before a corpus can be used with CQP and other CWB programs, various index files have to be built. It is also strongly recommended to compress these index files, especially for larger corpora:
$ cwb-make -V EXAMPLE
CORPUS_REGISTRY
, which is automatically
recognized by all CWB programs. In a Bourne shell (sh or
bash), this is achieved with the command
$ export CORPUS_REGISTRY=/home/stephanie/registry
In a C shell (csh or tcsh), the corresponding command is
$ setenv CORPUS_REGISTRY /home/stephanie/registry
In either case, it is probably a good idea to add this setting to your login profile (
~/.profile
or ~/.login
). If you do not want to set the
environment variable, you need to invoke cwb-make with
$ cwb-make -r /home/stephanie/registry -V EXAMPLE
On Windows, as noted above,cwb-make is not available, as it is part of CWB/Perl. However the same methods of setting the registry apply to use of the uitilities discussed in Section 4. Environment variables can be set persistently in Windows by going to the Settings app; clicking on “find a setting”; typing “environment”; and selecting “Edit environment variables”. In the interface that pops up, open the environment variables dialogue, and add a new variable with the name
CORPUS_REGISTRY
and the path to your registry as its value.
To set the registry temporarily in a terminal session, use this command:
$ set CORPUS_REGISTRY=C:\Users\stephanie\registry
The following examples assume that you either use the default registry directory or have set the
CORPUS_REGISTRY
variable
appropriately.
CORPUS_REGISTRY
environment
variable and the -r
options of command-line tools. This is convenient
e.g. if some corpora are stored on external hard drives that are not always
mounted. Such optional registry directories may be prefixed by a
question mark (?) in order to indicate that they may not be
accessible (otherwise CQP and some other tools will print warnings to alert
you to possible typos in the registry path). For instance, one of the lead
CWB developers has the following registry path in his ~/.bashrc
configuration:
$ export CORPUS_REGISTRY=/Corpora/registry:?/Volumes/X/CWB/registryThe built-in default registry directory is not automatically appended to this path. If you want to specify additional registry directories but keep the default one, you need to include the default location explicitly in the value of
CORPUS_REGISTRY
.
The -V switch enables additional validation passes when an index is created and when data files are compressed. It should be omitted when encoding very large corpora (above 50 million tokens), in order to speed up processing. In this case, it is also advisable to limit memory usage with the -M option. The amount specified should be somewhat less than the amount of physical RAM available (depending on the number of users etc.; too little is better than too much). For instance, on a Linux machine with 8 GiB of RAM, -M 2048 is a safe choice. The cwb-make utility applies a default limit of -M 75 if no explicit -M option is given, which is unreasonably small for current hardware, being optimised for machines of the last millennium.
$ cwb-describe-corpus EXAMPLE