2 First steps: Encoding and indexing

The standard CWB input format is one-word-per-line text,1 with the surface form in the first column and token-level annotations specified as additional TAB-separated columns. XML tags for sentence boundaries and other structural annotation must appear on separate lines. This file format is also called verticalized text and has the customary file extension .vrt. An example of the verticalized text format for a short sentence with part-of-speech and lemma annotations is shown in Figure 1. This file, as well as all other input files required by the following examples are made available in the accompanying data package.

Figure 1: Verticalized text file example.vrt
\begin{figure}\begin{quote}
\begin{verbatim}<s>
It PP it
was VBD be
an DT an
elephant NN elephant
. SENT .
</s>\end{verbatim}
\end{quote}
\end{figure}

In order to encode the file as a corpus, follow these steps:

The first column of the input file is automatically encoded as the default positional attribute (p-attribute) named word. -P flags are used to declare additional p-attributes, i.e. token-level annotations. -S flags declare structural attributes (s-attributes), which encode non-recursive XML tags and whose names must correspond to the XML element names. By convention, all attribute names must be lowercase (more precisely, they may only contain the characters a-z, 0-9, -, and _, and may not start with a digit). Therefore, the names of XML elements to be included in the CWB corpus must not contain any non-ASCII or uppercase letters.

The -R option automatically creates a registry file, whose filename has to be written in lowercase. Note that it is necessary to specify the full path to the registry file, even if the default registry directory is used. The CWB name of the corpus (also called the corpus ID) is identical to the name of the registry file, but is written in uppercase (here it will be EXAMPLE). The CWB name is used to activate a corpus in the query processor CQP, for instance.

-xsBC9 is a cluster of options which switch on data cleanup procedures. They are, in order: recognise and handle basic XML features (-x); ignore any empty lines (-s); tidy up stray blank space characters (-B); remove characters that are invalid for the specified encoding (-C); silently discard unrecognised XML tags (-9). Most of the time, you would want to use all of these; the only time to omit them is when you are working with files that you know have no encoding or formatting problems (or if you use the new -n or -N formats; see section 3). Using -x and -9 does not preclude more complex XML; see section 5.

The -c option specifies the character encoding (or charset) of the input data. The example.vrt file does not contain any non-ASCII characters, so in this example we specify -c ascii. The other commonly used charsets are Unicode UTF-8 (-c utf8) and ISO 8859-1(-c latin1). We strongly recommend use of UTF-8 over ISO 8859 charsets wherever possible. A full list of charsets supported by CWB, and the corresponding single word labels used with the -c option, is available in the manual file for cwb-encode.4

Input files with the extensions .gz, .bz2 or .xz are assumed to be in the gzip, bzip2 and xz compressed formats, respectively. Such files are automatically decompressed (provided that gzip, bzip2 and/or xz are available).5

Multiple input files can be specified by using the -f option repeatedly. Files will be read in the order in which they appear on the command line. Shell wildcards (e.g. -f *.txt) do not work, since each file name must be preceded by -f. However, it is possible to read all files named *.vrt, *.vrt.gz, *.vrt.bz2 or *.vrt.xz in a given directory using the -F option (possibly repeated for multiple directories). The input files in each specified directory will be read in alphabetical order.

All options (-d, -f, -R, etc.) must precede the attribute declarations (-P, -S, etc.) on the command line. It is mandatory to specify a data directory with the -d option.6 This directory should always be given as an absolute path, so the corpus can be used from any location in the file system.

Before a corpus can be used with CQP and other CWB programs, various index files have to be built. It is also strongly recommended to compress these index files, especially for larger corpora:

The -V switch enables additional validation passes when an index is created and when data files are compressed. It should be omitted when encoding very large corpora (above 50 million tokens), in order to speed up processing. In this case, it is also advisable to limit memory usage with the -M option. The amount specified should be somewhat less than the amount of physical RAM available (depending on the number of users etc.; too little is better than too much). For instance, on a Linux machine with 8 GiB of RAM, -M 2048 is a safe choice. The cwb-make utility applies a default limit of -M 75 if no explicit -M option is given, which is unreasonably small for current hardware, being optimised for machines of the last millennium.