2 First steps: Encoding and indexing

The standard CWB input format is one-word-per-line text,¹ with the surface form in the first column and token-level annotations specified as additional TAB-separated columns. XML tags for sentence boundaries and other structural annotation must appear on separate lines. This file format is also called verticalized text and has the customary file extension .vrt. An example of the verticalized text format for a short sentence with part-of-speech and lemma annotations is shown in Figure 1. This file, as well as all other input files required by the following examples are made available in the accompanying data package.

**Figure 1:** Verticalized text file example.vrt
$\begin{figure}\begin{quote} \begin{verbatim}<s> It PP it was VBD be an DT an elephant NN elephant . SENT . </s>\end{verbatim} \end{quote} \end{figure}$

In order to encode the file as a corpus, follow these steps:

Create a data directory where files in the binary CWB format will be stored. Here, we assume that this directory is called /corpora/data/example.² If this directory already exists and contains corpus data (from a previous version), you should delete all files in the directory. NB: You need a separate data directory for each corpus you want to encode.

Choose a registry directory, where all encoded corpora have to be registered to make them accessible to the CWB tools. It is recommended that you use the default registry directory. This varies depending on your operating system and method of installation. The most common locations³ of the default registry are:

On Unix, installed by compiling from source: /usr/local/share/cwb/registry
On Unix, installed via package manager: /usr/share/cwb/registry
On Windows, depends on your choice at install time, but most usually something like C:\Program Files\CorpusWorkbench\Registry

If you don't use the default registry, you will have to specify the path to your registry directory with a -r flag whenever you invoke one of the CWB tools (or set an appropriate environment variable, see below). In the example commands in this manual, we assume that you use the standard registry directory.

The next step is to encode the corpus, i.e. convert the verticalized text to CWB binary format with the cwb-encode tool. Note that the command below has to be entered on a single line.

$ cwb-encode -d /corpora/data/example 
             -xsBC9 -c ascii -f example.vrt 
             -R /usr/local/share/cwb/registry/example
             -P pos -P lemma -S s

(The $ character indicates a command line to be entered into your terminal. It is inspired by the customary input prompt used by the Bourne shells sh and bash.)

The first column of the input file is automatically encoded as the default positional attribute (p-attribute) named word. -P flags are used to declare additional p-attributes, i.e. token-level annotations. -S flags declare structural attributes (s-attributes), which encode non-recursive XML tags and whose names must correspond to the XML element names. By convention, all attribute names must be lowercase (more precisely, they may only contain the characters a-z, 0-9, -, and _, and may not start with a digit). Therefore, the names of XML elements to be included in the CWB corpus must not contain any non-ASCII or uppercase letters.

The -R option automatically creates a registry file, whose filename has to be written in lowercase. Note that it is necessary to specify the full path to the registry file, even if the default registry directory is used. The CWB name of the corpus (also called the corpus ID) is identical to the name of the registry file, but is written in uppercase (here it will be EXAMPLE). The CWB name is used to activate a corpus in the query processor CQP, for instance.

-xsBC9 is a cluster of options which switch on data cleanup procedures. They are, in order: recognise and handle basic XML features (-x); ignore any empty lines (-s); tidy up stray blank space characters (-B); remove characters that are invalid for the specified encoding (-C); silently discard unrecognised XML tags (-9). Most of the time, you would want to use all of these; the only time to omit them is when you are working with files that you know have no encoding or formatting problems (or if you use the new -n or -N formats; see section 3). Using -x and -9 does not preclude more complex XML; see section 5.

The -c option specifies the character encoding (or charset) of the input data. The example.vrt file does not contain any non-ASCII characters, so in this example we specify -c ascii. The other commonly used charsets are Unicode UTF-8 (-c utf8) and ISO 8859-1(-c latin1). We strongly recommend use of UTF-8 over ISO 8859 charsets wherever possible. A full list of charsets supported by CWB, and the corresponding single word labels used with the -c option, is available in the manual file for cwb-encode.⁴

Input files with the extensions .gz, .bz2 or .xz are assumed to be in the gzip, bzip2 and xz compressed formats, respectively. Such files are automatically decompressed (provided that gzip, bzip2 and/or xz are available).⁵

Multiple input files can be specified by using the -f option repeatedly. Files will be read in the order in which they appear on the command line. Shell wildcards (e.g. -f *.txt) do not work, since each file name must be preceded by -f. However, it is possible to read all files named *.vrt, *.vrt.gz, *.vrt.bz2 or *.vrt.xz in a given directory using the -F option (possibly repeated for multiple directories). The input files in each specified directory will be read in alphabetical order.

All options (-d, -f, -R, etc.) must precede the attribute declarations (-P, -S, etc.) on the command line. It is mandatory to specify a data directory with the -d option.⁶ This directory should always be given as an absolute path, so the corpus can be used from any location in the file system.

Before a corpus can be used with CQP and other CWB programs, various index files have to be built. It is also strongly recommended to compress these index files, especially for larger corpora:

The easiest and recommended method for indexing and compression is to use the cwb-make script that comes with the CWB/Perl interface modules. If you are unable to install the modules and use this script (e.g. if you are using the Windows version of CWB), refer to Section 4 for a manual procedure.

$ cwb-make -V EXAMPLE

If you did not use the standard registry directory /usr/local/share/cwb/registry when running cwb-encode, you will have to specify the path to your registry directory with the -r option. Alternatively, you can set the environment variable CORPUS_REGISTRY, which is automatically recognized by all CWB programs. In a Bourne shell (sh or bash), this is achieved with the command

$ export CORPUS_REGISTRY=/home/stephanie/registry

In a C shell (csh or tcsh), the corresponding command is

$ setenv CORPUS_REGISTRY /home/stephanie/registry

In either case, it is probably a good idea to add this setting to your login profile (~/.profile or ~/.login). If you do not want to set the environment variable, you need to invoke cwb-make with

$ cwb-make -r /home/stephanie/registry -V EXAMPLE

On Windows, as noted above,cwb-make is not available, as it is part of CWB/Perl. However the same methods of setting the registry apply to use of the uitilities discussed in Section 4. Environment variables can be set persistently in Windows by going to the Settings app; clicking on “find a setting”; typing “environment”; and selecting “Edit environment variables”. In the interface that pops up, open the environment variables dialogue, and add a new variable with the name CORPUS_REGISTRY and the path to your registry as its value. To set the registry temporarily in a terminal session, use this command:

$ set CORPUS_REGISTRY=C:\Users\stephanie\registry

The following examples assume that you either use the default registry directory or have set the CORPUS_REGISTRY variable appropriately.

You can also specify multiple registry directories separated by colon characters (:), both in the CORPUS_REGISTRY environment variable and the -r options of command-line tools. This is convenient e.g. if some corpora are stored on external hard drives that are not always mounted. Such optional registry directories may be prefixed by a question mark (?) in order to indicate that they may not be accessible (otherwise CQP and some other tools will print warnings to alert you to possible typos in the registry path). For instance, one of the lead CWB developers has the following registry path in his ~/.bashrc configuration:

$ export CORPUS_REGISTRY=/Corpora/registry:?/Volumes/X/CWB/registry

The built-in default registry directory is not automatically appended to this path. If you want to specify additional registry directories but keep the default one, you need to include the default location explicitly in the value of CORPUS_REGISTRY.

The -V switch enables additional validation passes when an index is created and when data files are compressed. It should be omitted when encoding very large corpora (above 50 million tokens), in order to speed up processing. In this case, it is also advisable to limit memory usage with the -M option. The amount specified should be somewhat less than the amount of physical RAM available (depending on the number of users etc.; too little is better than too much). For instance, on a Linux machine with 8 GiB of RAM, -M 2048 is a safe choice. The cwb-make utility applies a default limit of -M 75 if no explicit -M option is given, which is unreasonably small for current hardware, being optimised for machines of the last millennium.

Use the cwb-describe-corpus utility to display some information about an encoded corpus (add the -s option for details and to reassure yourself that all necessary data files have been created):

$ cwb-describe-corpus EXAMPLE