9.3 Advanced use of the aligner

It is possible to get improved results from cwb-align by making use of different parts of the original data, or by tweaking the configuration of the weight it gives to different kinds of comparison.

One tweak we can make is to the p-attribute used by the aligner to measure similarity. The following command, for instance, will used the lemma attribute as the “text” of the corpora when comparing their content:

$ cwb-align -P lemma -o holmes.align HOLMES-EN HOLMES-DE s

The different p-attributes need to have the same name in both source and target corpora for this to work.

There are various reasons why you might use an attribute other than the default word for the lexical comparisons. You might choose to use the lemma attribute, for instance, if the two languages are closely related but differ in their inflections (in which case the lemmata would be overall more similar to one another, and thus easier to align, than the actual word-tokens). Alternatively, you might choose to align using the lemma attribute if you had a bilingual lexicon available which contained lemmata. cwb-align is able use such a lexicon if it is available: words which are identified in the lexicon as equivalent will then count as “similar” for alignment purposes even if they are formally nothing alike.

Figure 9: A very short English-German bilingual lexicon file, lex.txt
\begin{figure}\begin{quote}
\begin{verbatim}be sein
sit sitzen
stand stehen\end{verbatim}
\end{quote}
\end{figure}

The format of a lexicon file is shown in figure 9. The aligner can be instructed to use it as follows:

$ cwb-align -P lemma -o holmes.align HOLMES-EN HOLMES-DE s -W:50:lex.txt

The -W option is an aligner configuration flag, so it goes after the names of the corpora and the grid attribute - in contrast to the general options already discussed, which precede the names of the corpora.

When the -W flag is used, you must specify two things: first the weight to be given to words that match when aligning sentences, and then the name of the file containing the pairs of equivalents. The weight given in the example above, 50, is equal to the default weight given to an occurence of the exact same word in both languages. This number is one of the parameters that you can change to try to improve the alignment output; see also below. The second thing that must be specified is, of course, the name of the lexicon file.

There are many other parameters that can be tweaked and it may be worth experimenting to see what gives you the best results. We won't cover the details here. All are described in full in the cwb-align manual file (accessed by the command man cwb-align on Unix, provided as a separate file on Windows).

One thing worth noting, however, is that it is possible to use pre-alignment. “Pre-alignment” means that some correspondances are known in advance. In a novel, for instance, there may be chapter boundaries which match across translations, and we can say for certain that a sentence in chapter 1 in language A will not be aligned with a sentence in any other chapter than chapter 1 in language B. This makes the aligner's task easier.

If the indexed corpora contain such pre-alignment information encoded as an s-attribute, then the aligner can be instructed to use it.

In the HOLMES corpora there exist paragraphs (s-attribute p). Let us assume that these paragraphs are pre-aligned: we know that a given paragraph in HOLMES-EN matches one and only one paragraph in HOLMES-DE, and that these links are known; it is only the alignment of sentences within each paragraph pair that needs to be found out.

In this case we can add either the -S or -V option to cwb-align.

If we specify paragraph pre-alignment with -S, then the aligner assumes that the source and target corpora have the same number of paragraphs, and that the first paragraph in the source (HOLMES-EN) corresponds to the first paragraph in the target (HOLMES-DE), the second to the second, and so on. This would be done as follows:

$ cwb-align -S p -o holmes.align HOLMES-EN HOLMES-DE s

Alternatively we can use -V. In this case, paragraphs will not be matched up by order - rather, they are matched up by the value of the s-attribute. Since the Holmes corpora have num as an annotation on <p>, there is an s-attribute p_num which has values and can be used in this way. This is done as follows:

$ cwb-align -V p_num -o holmes.align HOLMES-EN HOLMES-DE s

In this case, the order of the paragraphs does not matter: the aligner will always try to match sentences in paragraph 3 in one corpus to sentences in paragraph 3 in the other corpus.

Using pre-alignment improves the output, because fewer possibilities have to be checked for the alignment of each sentence.