9 Sentence alignment
An alignment between two parallel corpora
(e.g. a collection of source texts and their translations into some other language)
can be encoded as a corpus attribute within CWB.
- Alignment attributes (a-attributes) are unlike other types of attribute
because alignment presupposes the existence of the source and target corpora.
That is, first we need to encode the two corpora independently;
then we can add the alignment attribute that links them.
- Alignment attributes are usually employed for sentence alignment,
and we will assume throughout that it is sentences that we are aligning.
- However, you can also align at some other level
(e.g. clauses or paragraphs or chapters).
Aligning regions that are much smaller than a
sentence will not be very useful because of
the limitations of how CQP deals with a-attributes.
- Only one a-attribute linking any particular pair of corpora can exist.
- There are two ways that a pair of corpora can be aligned.
- First, the cwb-align tool can be used
to automatically align the sentences of the two corpora,
with its output subsequently encoded as an
a-attribute using cwb-align-encode.
- Second, an existing alignment scheme encoded in
the corpus markup can be imported as an a-attribute using cwb-align-import.
CWB supports many types of alignment link: one to one, many to many, and crossing.
However, the regions in the corpora that are the units
to be aligned with one another cannot be discontinuous.
Subsections