The cwb-align program is a very simple text aligner. It can be considered a “fallback” option for sentence alignment, designed to provide basic functionality when nothing better is available. If your corpus is already aligned, it is always better to use that existing alignment data. Similarly, if you have a properly-designed and trained aligner for a given language pair, it is always better to use that than to rely on cwb-align.
In particular, cwb-align will not work well on languages that are unrelated to the extent of sharing little or no vocabulary, as it works by looking for similarities in the words used in the two corpora it analyses.
cwb-align makes use of very basic techniques to align units in two parallel corpora by spotting those units - assumed to be of about sentence length - that have similar content. It looks for similarities in terms of:
Here is how we might create an alignment from scratch
and then encode it using the two HOLMES corpora, assuming that the <s>
elements are the units to be aligned.
The most basic use of cwb-align would be as follows:
$ cwb-align -o holmes.align HOLMES-EN HOLMES-DE s
This command has one option and three arguments. The -o option simply specifies a filename for the output data. The first and second arguments are the labels of the source corpus and the target corpus respectively. The third argument is the grid attribute, that is, the s-attribute used as the alignment grid.
The output file has five columns (see figure 8). The first line is a header line with the names of the aligned corpora and of the grid attribute. Each subsequent line specifies a pair of aligned regions:
However, it is not normally necessary for a human being to read the file. Usually it is used only as input data for the next step (see below).
To check whether the aligner worked correctly, you can view this file interactively using the cwb-align-show program. The command to run this program is:
$ cwb-align-show holmes.align
(you can use the -w option for a wider display, if your terminal window is big enough).
Press Return to display the next alignment pair, h for other key commands, and q to exit the viewer.
If your parallel corpus is large, it may be advisable to compress the .align file by specifying a filename with extension .gz, .bz2, or .xz. All CWB alignment tools handle such compressed files transparently.