9.5 Importing a pre-existing alignment

What if your corpora have already been aligned, either manually or using a better aligner than cwb-align? In this case, you can create an a-attribute by importing such existing alignment information with cwb-align-import.

cwb-align-import is not part of the main CWB core, but is instead one of the CWB/Perl tools.

The procedure to import an alignment from existing information is as follows.

First, you must encode your information into an alignment beads file. An alignment bead is one single point of alignment between the source and target corpora. An alignment beads file is a file defining a series of beads, plus some header information.

The header line of a beadfile has four items, separated by tabs:

After that, every line contains a single alignment bead. This consists of one or more source corpus IDs, then a tab, then one or more target corpus IDs. The IDs must follow the key pattern. Let's consider each of these elements in further detail.

The key pattern specifies how ID codes in the beadfile relate to ID codes for regions in your indexed corpora. The ID codes in the beadfile must be unique across each of the corpora.

The most basic case is when we can directly use an indexed s-attribute that has annotation values. For instance, let's assume that the grid attribute is s, and that in both corpora there is a subsidiary s-attribute called s_id which contains codes that uniquely identify the sentences: s_s01,s_s02, s_s03 etc. in the source corpus, and t_s01,t_s02, t_s03 etc. in the target corpus. In that case, we can simply use s_id on its own as the key pattern. That causes the IDs on the other lines of the beadfile to be matched against the contents of s_id. To accomplish this, we specify the key pattern simply as “id” within curly brackets.

Lines after the header must each contain a single alignment bead. A bead consists of two columns, separated by a tab. Each column contains one or more space-separated IDs (in our example, from the s_id attribute) associated with regions which are to be treated as aligned to one another. The IDs in the first column relate to the source corpus, and the IDs in the second column relate to the target corpus.

The overall beadfile for our hypothetical would then look something like this:

CORPUS-SL    NAME-TL    s    {id}
s_s01 s_s02    t_s01
s_s03    t_s02
s_s04    t_s03 t_s04
(...)

This specifies that:

Such a beadfile can then be imported, creating the a-attribute, using the following command:

$ cwb-align-import -p beadfile.txt

The -p option, short for prune, should normally be used. It makes cwb-align-import ignore any beads with one or more IDs that don't actually occur in the corpus. Without this option, cwb-align-import will abort with an error message if it encounters any bad IDs.

It is not necessary to specify in the cwb-align-import command which corpora are being aligned, because that information is on the header line of the beadfile. However, both corpora do need to exist in the active corpus registry (however that is specified).

In our bilingual Holmes corpus, things are a little more complicated. The sentences do have ID codes, so there is an s-attribute called s_id. However, the values of this attribute are not unique. Each paragraph re-uses the same ID codes, starting with a,then b, then c, and so on. So just “id” in the key pattern will not work.

We can make the codes unique by combining s-attributes together. Each paragraph in Holmes is numbered, in the s-attribute called p_num. Therefore, if we combine together the paragraph number and the sentence ID, we will have unique identifiers.

We do this by specifying both attributes in the key pattern, each within braces. We must specify their names in full, as p_num and s_id, unlike the simple case (where id is automatically expanded to s_id on the basis of the grid attribute). A key pattern can also contain constant elements around the s-attributes enclosed in curly braces, to allow the IDs to be friendlier in appearance.

An actual beadfile for the Holmes corpora is provided as part of the data package. Its filename is holmes_en_de_align.txt, and its full format is shown in figure 10). The key pattern in this beadfile consists of a constant s followed by the two s-attributes, p_num and s_id. This means that the leading “s” in each key in the rest of the file will be ignored, and the remainder looked up against p_num and s_id. The key s1a therefore matches paragraph 1, sentence a.

(For further examples of complex key patterns, see man cwb-align-import.)

Figure 10: Beadfile holmes_en_de_align.txt for the Holmes corpus pair
\begin{figure}\begin{quote}
\begin{verbatim}HOLMES-EN HOLMES-DE s s{p_num}{s_i...
...8a s8a
s9a s9b s9a
s9b
s9c s9c
... ...\end{verbatim}
\end{quote}
\end{figure}

The Holmes beadfile includes beads which express empty alignments, that is, correspondances between a region in one corpus and nothing in the other corpus. Region s7a in the English corpus corresponds to nothing in the German corpus; likewise region s9b in the German corpus corresponds to nothing in the English corpus. These kinds of alignments aren't possible in a CWB a-attribute, and by default will cause a fatal error. The flag -e (“empty”) tells cwb-align-import to ignore these lines instead. So we must use -e when importing holmes_en_de_align.txt. Fortunately, -e is automatically activated by the -p flag.

The overall command is thus:

$ cwb-align-import -p holmes_en_de_align.txt

Another flag that is often useful is -i. This inverts the source corpus and target corpus from what is declared in the beadfile. It means that you can create two a-attributes, one going each way, from the same beadfile. So, since the command above creates an a-attribute in HOLMES-EN, pointing at HOLMES-DE, for an a-attribute in HOLMES-DE, pointing at HOLMES-EN, you can use the following command:

$ cwb-align-import -i -p holmes_en_de_align.txt