5 CWB corpora and XML

Nowadays, machine-readable text and linguistic annotations are often provided in XML format. CWB's XML support is activated by the following encoding options: -x for XML compatibility mode (recognises default entities and skips comments as well as an XML declaration), -s to skip blank lines in the input, and -B to strip whitespace from tokens. All three options -xsB should (almost) always be used.10 The vertical text format with TAB-separated p-attributes is still required by cwb-encode, but this format can easily be generated from an arbitrary XML file with the aid of a little script in any suitable language. Figure 3 shows a typical example of an XML input file for the CWB, including an XML declaration and a comment line that options -xsB will cause to be ignored. Note that despite the use of tabs for columns, this is still a well-formed XML file.

Figure 3: Verticalized XML file vss.vrt
\begin{figure}\begin{quote}
\begin{verbatim}<?xml version=''1.0'' encoding=''I...
...VB tick
. SENT .
</s>
</p>
...
</story>\end{verbatim}
\end{quote}
\end{figure}

XML elements (i.e. matching pairs of start and end tags) can be encoded as s-attributes, which have to be declared with -S flags (for the file vss.vrt, the flags -S story -S p -S s would be used). If XML regions of the same type are nested, encoding will only work correctly if you add :0 to the s-attribute declaration, which enables a rudimentary XML parser built into cwb-encode. Attribute-value pairs in XML start tags, such as

<story num="4" title="A Thrilling Experience">
can be stored as a single unparsed text string (num="4" title="A Thrilling Experience") by using the flag -V instead of -S. This form of encoding is not convenient for CQP queries, though. It is more desirable to declare XML tag attributes explicitly; doing so will automatically split the XML elements into multiple s-attributes.

These commands will encode the corpus VSS and create a registry file, including the s-attributes s, p, story, story_num, and story_title. The <story> start tags are parsed and the attribute values are stored as annotations of the attributes story_num (value: 4) and story_title (value: A Thrilling Experience). Regions of the story attribute itself will not be annotated. Use -V instead of -S to store all attribute-value pairs as a single string, which can be useful for displaying and re-exporting the XML tags.

XML elements with different names (such as <s> and <p>) are encoded independently, so they can nest and overlap in arbitrary ways. The cwb-encode program does not perform any validation or well-formedness tests on the XML elements. When elements are nested recursively (e.g. a <table> within a <table>), the embedded elements will be ignored, because of the use of :0 specified above. After encoding, cwb-encode prints a summary listing the number of dropped XML elements. If you instead want to preserve nested elements, you can specify a maximal level of embedding instead of :0 in the examples above. For instance, -S table:2 allows two levels of embedding for <table> elements. Nested elements are automatically renamed to <table1> and <table2>, respectively, and stored in separate s-attributes.

Sometimes, the input data may contain XML tags that should not be encoded in the corpus. For instance, the stories in vss.vrt have to be wrapped in a single root element <collection> in order to obtain a well-formed XML file. Instead of removing such tags during data preparation, they can directly be filtered out by the cwb-encode tool. For this purpose, they have to be declared with the flag -0 (digit zero, for “null attribute”) instead of -S or -V. All start and end tags of these elements will be ignored completely. There is no need to add :0 or XML attribute declarations. Note that all XML tags that have not been declared with a -S, -V or -0 flag will be encoded as literal tokens (that is, words, without annotations), accompanied by a warning message.

Starting with CWB 3.4.21, unknown XML tags can automatically be declared as null attributes with the -9 (“auto-null”) option. This is recommended to capture the correct token stream for an input file with very many and/or undocumented XML elements:

$ cwb-encode -d /corpora/data/vss -f vss.vrt
             -R /usr/local/share/cwb/registry/vss
             -xsBC -c ascii -9 -P pos -P lemma

You may have noticed in Figure 3 that the XML file is declared to be in ISO 8859-1 (or Latin-1) encoding rather than the standard UTF-8 format. CWB ignores this, along with the rest of the XML declaration. The charset still needs to specified on the command line; here, it is -c ascii, since we know there are no non-ASCII characters in this particular file; see also section 2.