6 Adding attributes to an encoded corpus

In order to add positional attributes to a corpus that has already been encoded, create input data in the standard verticalized format, but listing only the new attributes. Figure 4 shows an example of such an input file, containing WordNet synonyms for the tokens from Figure 1 (without attempting any form of word sense disambiguation). A corresponding list of synonyms for the complete VSS corpus can be found in the file syns.vrt.

Figure 4: WordNet synonyms for the text shown in Figure 1 (excerpt from file syns.vrt)
\begin{figure}\begin{quote}
\begin{verbatim}\vert
\vert be\vert cost\vert live...
.....\vert
\vert
\vert elephant\vert
\vert\end{verbatim}
\end{quote}
\end{figure}

The special notation seen in Figure 4 indicates that the synonyms for any given word constitute an unordered set (or feature set in CWB terminology). Vertical bars (|) separate individual set elements and enclose the entire set; a single bar | denotes the empty set. Feature sets are stored as plain strings in a CWB-encoded corpus, but the special notation enables the query processor CQP to test whether a particular string is contained in the set, match all set elements against a regular expression, and compute the intersection of two sets.

In order to add structural attributes with computed start and end points (corpus positions), you can use the cwb-s-encode tool. The corresponding start and end positions of existing s-attributes can be obtained with cwb-s-decode. The following example adds information about sentence length to the VSS corpus.

Tables of corpus positions as input for cwb-s-encode can also be created from CQP query results using the dump or tabulate command in a CQP session.