9.1 The example corpora

First, let's introduce the tutorial data we'll be working with. All the files mentioned here are available as part of the data package provided alongside the CWB Encoding Manual. The corpus we'll use to practice alignment consists of a very short excerpt from the novel The Hound of the Baskervilles by Arthur Conan Doyle, which we'll call the Holmes corpus after the main character. As well as the original English, we have a German translation of the same text. We'll use the CWB labels HOLMES-EN for the source corpus and HOLMES-DE for the target corpus (i.e. the translation) respectively. Using language codes to distinguish components of a parallel corpus in this way is a useful way to organise labels for aligned corpora in CWB.

Figure 6: Example from the source corpus (file holmes_en.vrt), with abbreviations
\begin{figure}\begin{quote}
\begin{verbatim}<p num=''3''>
<s id=''a''>
Mr. NP ...
...
</s>
[... two more sentences ...]
</p>\end{verbatim}
\end{quote}
\end{figure}

Figure 7: Example from the target corpus (file holmes_de.vrt), with abbreviations
\begin{figure}\begin{quote}
\begin{verbatim}<p num=''3''>
<s id=''a''>
Mr. NN ...
.../s>
[... three more sentences ...]
</p>\end{verbatim}
\end{quote}
\end{figure}

Before going any further, you should index these two corpora, using the following commands:

$ cwb-encode -d /corpora/data/example -c utf8 -f holmes_en.vrt 
             -R /usr/local/share/cwb/registry/holmes-en
             -P pos -P lemma -S s+id -S p+num
$ cwb-encode -d /corpora/data/example -c utf8 -f holmes_de.vrt 
             -R /usr/local/share/cwb/registry/holmes-de
             -P pos -P lemma -S s+id -S p+num

(you should, of course, amend the -d and -R options to suit your own setup).

All the example commands given in the following sections are based on these two corpora. They do not include the -r option to specify the registry directory location. If you have placed the registry files for the two corpora anywhere other than the default registry, you will need either to add the -r option, or else to use the CWB_REGISTRY environment variable.