7 Adding XML annotations

In order to add XML annotations (e.g. <np> and <pp> tags inserted by a chunk parser) to an existing corpus, the usual strategy is to decode the token stream (and other attributes if necessary) to a temporary file. A chunk parser will often expect <s> and </s> tags marking sentence boundaries.

Now we can use cwb-encode to encode the XML annotations as structural attributes. The start and end points of regions are automatically computed from the token stream. Since we do not want to overwrite the word attribute, we specify -p -. With no p-attributes declared, all lines in the input file except for the XML tags will be ignored. Recall that -0 s (digit zero) instructs cwb-encode to ignore <s> and </s> tags (without -S s they would otherwise be interpreted as literal tokens and mess up the token stream).

In this example, cwb-encode will issue warnings about nested regions being dropped. As can be seen from Figure 5, <np> (as well as <pp>) regions may be embedded recursively. In order to preserve such nested regions, change the :0 modifier to :2, allowing up to two levels of embedding (separately for each element type, i.e. <np> regions embedded in larger <np> regions, etc.). In general, :$n$ allows up to $n$ levels of embedding. The embedded regions will automatically be renamed to np1, np2, pp1, and pp2, respectively.