The following steps illustrate the transformation of textual data with some XML markup into the CWB data format.
A schematic representation of the encoded corpus is shown in Figure 1. Each token (together with its annotations) corresponds to a row in the tabular format. The row numbers, starting from 0, uniquely identify each token and are referred to as corpus positions.
Each (token-level) annotation layer corresponds to a column in the table, called a positional attribute or p-attribute (note that the original word forms are also treated as an attribute with the special name word). Annotations are always interpreted as character strings, which are collected in a separate lexicon for each positional attribute. The CWB data format uses lexicon IDs for compact storage and fast access.
Matching pairs of XML start and end tags are encoded as token regions, identified by the corpus positions of the first token (immediately following the start tag) and the last token (immediately preceding the end tag) of the region. (Note how the corpus position of an XML tag in Figure 1 is identical to that of the following or preceding token, respecitvely.) Elements of the same name (e.g. <s>...</s> or <text>...</text>) are collected and referred to as a structural attribute or s-attribute. The corresponding regions must be non-overlapping and non-recursive. Different s-attributes are completely independent in the CWB: a hierarchical nesting of the XML elements is neither required nor can it be guaranteed.
Key-value pairs in XML start tags can be stored as an annotation of the
corresponding s-attribute region. All key-value pairs are treated as a
single character string, which has to be “parsed” by a CQP query that
needs access to individual values. In the recommended encoding procedure,
an additional s-attribute (named element_key) is
automatically created for each key and is directly annotated with the
corresponding value (cf. <text_id>
and <text_lang>
in
Figure 1).
Since s-attributes are non-recursive, XML markup such as
<np>the man <pp>with <np>the telescope</np></pp> </np>
is not allowed in a CWB corpus (the embedded <np> region will
automatically be dropped).2 In the recommended encoding procedure, embedded regions (up to a pre-defined
level of embedding) are automatically renamed by adding digits to the
element name:
<np>the man <pp>with <np1>the telescope</np1></pp> </np>