1.2 The CWB corpus data model

The following steps illustrate the transformation of textual data with some XML markup into the CWB data format.

  1. Formatted text (as displayed on-screen or printed)

    An easy example. Another very easy example. Only the easiest examples!
  2. Text with XML markup (at the level of texts, words or characters)

    <text id=42 lang="English"> <s>An easy example.</s><s> Another <i>very</i> easy example.</s> <s><b>O</b>nly the <b>ea</b>siest ex<b>a</b>mples!</s> </text>
  3. Tokenised text (character-level markup has to be removed)

    <text id=42 lang="English"> $\;$ <s> $\;$ An $\;$ easy $\;$ example $\;$ . $\;$ </s> $\;$ <s> $\;$ Another $\;$ very $\;$ easy $\;$ example $\;$ . $\;$ </s> $\;$ <s> $\;$ Only $\;$ the $\;$ easiest $\;$ examples $\;$ ! $\;$ </s> $\;$ </text>
  4. Text with linguistic annotations (annotations are added at token level)

    <text id=42 lang="English"> $\;$ <s> $\;$ An/DET/a $\;$ easy/ADJ/easy $\;$ example/NN/example $\;$ ./PUN/. $\;$ </s> $\;$ <s> $\;$ Another/DET/another $\;$ very/ADV/very $\;$ easy/ADJ/easy $\;$ example/NN/example $\;$ ./PUN/. $\;$ </s> $\;$ <s> $\;$ Only/ADV/only $\;$ the/DET/the $\;$ easiest/ADJ/easy $\;$ examples/NN/example $\;$ !/PUN/! $\;$ </s> $\;$ </text>
  5. Text encoded as CWB corpus (tabular format, similar to relational database)

    A schematic representation of the encoded corpus is shown in Figure 1. Each token (together with its annotations) corresponds to a row in the tabular format. The row numbers, starting from 0, uniquely identify each token and are referred to as corpus positions.

    Each (token-level) annotation layer corresponds to a column in the table, called a positional attribute or p-attribute (note that the original word forms are also treated as an attribute with the special name word). Annotations are always interpreted as character strings, which are collected in a separate lexicon for each positional attribute. The CWB data format uses lexicon IDs for compact storage and fast access.

    Matching pairs of XML start and end tags are encoded as token regions, identified by the corpus positions of the first token (immediately following the start tag) and the last token (immediately preceding the end tag) of the region. (Note how the corpus position of an XML tag in Figure 1 is identical to that of the following or preceding token, respecitvely.) Elements of the same name (e.g. <s>...</s> or <text>...</text>) are collected and referred to as a structural attribute or s-attribute. The corresponding regions must be non-overlapping and non-recursive. Different s-attributes are completely independent in the CWB: a hierarchical nesting of the XML elements is neither required nor can it be guaranteed.

    Key-value pairs in XML start tags can be stored as an annotation of the corresponding s-attribute region. All key-value pairs are treated as a single character string, which has to be “parsed” by a CQP query that needs access to individual values. In the recommended encoding procedure, an additional s-attribute (named element_key) is automatically created for each key and is directly annotated with the corresponding value (cf. <text_id> and <text_lang> in Figure 1).

  6. Recursive XML markup (can be automatically renamed)

    Since s-attributes are non-recursive, XML markup such as

    <np>the man <pp>with <np>the telescope</np></pp> </np>
    is not allowed in a CWB corpus (the embedded <np> region will automatically be dropped).2 In the recommended encoding procedure, embedded regions (up to a pre-defined level of embedding) are automatically renamed by adding digits to the element name:
    <np>the man <pp>with <np1>the telescope</np1></pp> </np>

Figure 1: Sample text encoded as a CWB corpus.
\begin{figure}\centering
\texttt{
\begin{tabular}{\vert c\vert\vert lr\vert lr...
...text>} \\
\hhline{\vert-\vert\vert------\vert}
\end{tabular} }
\end{figure}