7 Adding XML annotations

In order to add XML annotations (e.g. <np> and <pp> tags inserted by a chunk parser) to an existing corpus, the usual strategy is to decode the token stream (and other attributes if necessary) to a temporary file. A chunk parser will often expect <s> and </s> tags marking sentence boundaries.

Decode token stream (word forms) with start and end tags for <s> regions.

$ cwb-decode -C VSS -P word -S s > word_s.vrt

We then run the chunk parser on the temporary file. The chunk parser adds its <np> and <pp> tags to the token stream, creating the file shown in Figure 5. This file is also provided as part of the data package for this manual.

**Figure 5:** Decoded text with chunk annotations (file chunks.vrt)
$\begin{figure}\begin{quote} \begin{verbatim}<s> <np head=''experience> My expe... ...life </np> </pp> </np> did not ... </s>\end{verbatim} \end{quote} \end{figure}$

It is important that the token stream is left intact when adding XML annotations. In particular, tokens (as well as XML tags) must remain on separate lines and may not be split or combined. As a preliminary check, make sure that the number of tokens in chunks.vrt is equal to the corpus size. On Unix, the grep and wc utilities can be used for this:

$ grep -v '^<' chunks.vrt | wc -l

Now we can use cwb-encode to encode the XML annotations as structural attributes. The start and end points of regions are automatically computed from the token stream. Since we do not want to overwrite the word attribute, we specify -p -. With no p-attributes declared, all lines in the input file except for the XML tags will be ignored. Recall that -0 s (digit zero) instructs cwb-encode to ignore <s> and </s> tags (without -S s they would otherwise be interpreted as literal tokens and mess up the token stream).

Encode <np> and <pp> regions in chunks.vrt as new s-attributes:

$ cwb-encode -d /corpora/data/vss -f chunks.vrt 
             -p - -0 s -S np:0+head -S pp:0+head

In this example, cwb-encode will issue warnings about nested regions being dropped. As can be seen from Figure 5, <np> (as well as <pp>) regions may be embedded recursively. In order to preserve such nested regions, change the :0 modifier to :2, allowing up to two levels of embedding (separately for each element type, i.e. <np> regions embedded in larger <np> regions, etc.). In general, : $n$

allows up to $n$

levels of embedding. The embedded regions will automatically be renamed to np1, np2, pp1, and pp2, respectively.

Encode chunks.vrt, allowing up to two levels of embedding for <np> and <pp> regions:

$ cwb-encode -d /corpora/data/vss -f chunks.vrt 
             -p - -0 s -S np:2+head -S pp:2+head

The full list of s-attributes created by this command is np, np1, np2, np_head, np_head1, np_head2, pp, pp1, pp2, pp_head, pp_head1, and pp_head2. They all have to be declared in the registry file of the corpus VSS, either by adding the appropriate entries manually, or with the registry editor script:

$ cwb-regedit VSS :add :s np np1 np2 np_head np_head1 np_head2
$ cwb-regedit VSS :add :s pp pp1 pp2 pp_head pp_head1 pp_head2

Attribute-value pairs in XML start tags may contain feature sets, just as is possible for p-attributes. For instance, the German chunk parser YAC¹⁴ uses this notation to represent partially disambiguated morphological features of NPs and PPs (see the CQP Query Language Manual for more information and examples). XML tags of the form

    <np agr="|Nom:F:Sg|Acc:F:Sg|" head="Wiese">

might be encoded with the declaration -S np:2+agr/+head, where the slash / indicates that agr values are feature sets. Since head is not followed by a slash, the corresponding values are not treated as feature sets.