In order to add XML annotations (e.g. <np> and <pp> tags inserted by a chunk parser) to an existing corpus, the usual strategy is to decode the token stream (and other attributes if necessary) to a temporary file. A chunk parser will often expect <s> and </s> tags marking sentence boundaries.
<s>
regions.
$ cwb-decode -C VSS -P word -S s > word_s.vrt
$ grep -v '^<' chunks.vrt | wc -l
Now we can use cwb-encode to encode the XML annotations as structural attributes. The start and end points of regions are automatically computed from the token stream. Since we do not want to overwrite the word attribute, we specify -p -. With no p-attributes declared, all lines in the input file except for the XML tags will be ignored. Recall that -0 s (digit zero) instructs cwb-encode to ignore <s> and </s> tags (without -S s they would otherwise be interpreted as literal tokens and mess up the token stream).
<np>
and <pp>
regions in chunks.vrt as
new s-attributes:$ cwb-encode -d /corpora/data/vss -f chunks.vrt -p - -0 s -S np:0+head -S pp:0+head
In this example, cwb-encode will issue warnings about nested regions being dropped. As can be seen from Figure 5, <np> (as well as <pp>) regions may be embedded recursively. In order to preserve such nested regions, change the :0 modifier to :2, allowing up to two levels of embedding (separately for each element type, i.e. <np> regions embedded in larger <np> regions, etc.). In general, : allows up to levels of embedding. The embedded regions will automatically be renamed to np1, np2, pp1, and pp2, respectively.
<np>
and <pp>
regions:$ cwb-encode -d /corpora/data/vss -f chunks.vrt -p - -0 s -S np:2+head -S pp:2+head
$ cwb-regedit VSS :add :s np np1 np2 np_head np_head1 np_head2 $ cwb-regedit VSS :add :s pp pp1 pp2 pp_head pp_head1 pp_head2
<np agr="|Nom:F:Sg|Acc:F:Sg|" head="Wiese">might be encoded with the declaration
-S np:2+agr/+head
, where the
slash / indicates that agr values are feature sets. Since
head is not followed by a slash, the corresponding values are not
treated as feature sets.