6 Adding attributes to an encoded corpus

In order to add positional attributes to a corpus that has already been encoded, create input data in the standard verticalized format, but listing only the new attributes. Figure 4 shows an example of such an input file, containing WordNet synonyms for the tokens from Figure 1 (without attempting any form of word sense disambiguation). A corresponding list of synonyms for the complete VSS corpus can be found in the file syns.vrt.

**Figure 4:** WordNet synonyms for the text shown in Figure 1 (excerpt from file syns.vrt)
$\begin{figure}\begin{quote} \begin{verbatim}\vert \vert be\vert cost\vert live... .....\vert \vert \vert elephant\vert \vert\end{verbatim} \end{quote} \end{figure}$

The special notation seen in Figure 4 indicates that the synonyms for any given word constitute an unordered set (or feature set in CWB terminology). Vertical bars (|) separate individual set elements and enclose the entire set; a single bar | denotes the empty set. Feature sets are stored as plain strings in a CWB-encoded corpus, but the special notation enables the query processor CQP to test whether a particular string is contained in the set, match all set elements against a regular expression, and compute the intersection of two sets.

The file syns.vrt is encoded as usual, but the default word attribute has to be suppressed with the option -p -. It is highly recommended to check that the number of tokens in the new file (find this out on Unix with the command wc -l syns.vrt) is equal to the corpus size (as reported by cwb-lexdecode -S EXAMPLE), so that the new attribute is properly aligned to the rest of the corpus.

$ cwb-encode -d /corpora/data/vss -f syns.vrt -p - -P syn/

Notice the slash (/) appended to the attribute name syn. This notation indicates that the new attribute should be treated as a feature set; cwb-encode will automatically validate and normalise the supplied values, issuing warnings if they are not well-formed feature sets.¹¹

As of CWB v3.4.28, cwb-encode is more lenient with the feature set format, also accepting input without the leading and trailing |, e.g. baggage|luggage and elephant (for a single-member set). An empty string¹² or single underscore (_) is interpreted as an empty set. This change was introduced to provide better support for CoNLL-style set notation (also used e.g. by TreeTagger lemmas), which can now be encoded without a pre-processing step. As a consequence, there will no longer be a warning if an attribute is mistakenly declared as a feature set (e.g. -P word/); the values will silently be transformed into single-item sets.

The registry file for the corpus VSS (which you will find in the registry folder specified when it was encoded, or if none was specified, the default registry) now needs to be edited to add a declaration of the new attribute. Add the line

ATTRIBUTE syn

at the bottom of the file. If the CWB/Perl interface has been installed, the registry file can also be edited from the command line with the cwb-regedit registry editor script:

$ cwb-regedit VSS :add :p syn

This script can also be used to list and delete attributes, and to print basic information about a corpus (similar to cwb-describe-corpus, but easier for further processing). Type cwb-regedit -h for further information.

Now you can build index files and compress the new attribute:

$ cwb-make -V VSS

$ cwb-makeall -V VSS syn
$ cwb-huffcode -P syn VSS
$ cwb-compress-rdx -P syn VSS

In order to add structural attributes with computed start and end points (corpus positions), you can use the cwb-s-encode tool. The corresponding start and end positions of existing s-attributes can be obtained with cwb-s-decode. The following example adds information about sentence length to the VSS corpus.

The existing s attribute is decoded into a temporary file, then awk¹³ is used to compute sentence lengths, and the resulting annotated regions are encoded with cwb-s-encode.

$ cwb-s-decode VSS -S s > s.list
$ awk 'BEGIN { FS=OFS="\t" }  { print $1, $2, $2-$1+1 }' s.list > s_len.list
$ cwb-s-encode -d /corpora/data/vss -f s_len.list -V s_len

Note that it is currently not necessary to run cwb-make after adding an s-attribute.

However, the new attribute still has to be declared in the registry file, either by manually adding

STRUCTURE s_len

or from the command line using the registry editor script:

$ cwb-regedit VSS :add :s s_len

Tables of corpus positions as input for cwb-s-encode can also be created from CQP query results using the dump or tabulate command in a CQP session.