In order to add positional attributes to a corpus that has already been encoded, create input data in the standard verticalized format, but listing only the new attributes. Figure 4 shows an example of such an input file, containing WordNet synonyms for the tokens from Figure 1 (without attempting any form of word sense disambiguation). A corresponding list of synonyms for the complete VSS corpus can be found in the file syns.vrt.
The special notation seen in Figure 4 indicates that the
synonyms for any given word constitute an unordered set (or
feature set in CWB terminology). Vertical bars (|
) separate
individual set elements and enclose the entire set; a single bar |
denotes the empty set. Feature sets are stored as plain strings in a
CWB-encoded corpus, but the special notation enables the query processor CQP
to test whether a particular string is contained in the set, match all set
elements against a regular expression, and compute the intersection of two
sets.
$ cwb-encode -d /corpora/data/vss -f syns.vrt -p - -P syn/
Notice the slash (/) appended to the attribute name syn. This notation indicates that the new attribute should be treated as a feature set; cwb-encode will automatically validate and normalise the supplied values, issuing warnings if they are not well-formed feature sets.11
_
) is interpreted as an
empty set. This change was introduced to provide better support for
CoNLL-style set notation (also used e.g. by TreeTagger lemmas), which can
now be encoded without a pre-processing step. As a consequence, there will
no longer be a warning if an attribute is mistakenly declared as a feature
set (e.g. -P word/); the values will silently be transformed into
single-item sets.
ATTRIBUTE synat the bottom of the file. If the CWB/Perl interface has been installed, the registry file can also be edited from the command line with the cwb-regedit registry editor script:
$ cwb-regedit VSS :add :p syn
This script can also be used to list and delete attributes, and to print basic information about a corpus (similar to cwb-describe-corpus, but easier for further processing). Type
cwb-regedit -h
for further
information.
$ cwb-make -V VSS
or
$ cwb-makeall -V VSS syn $ cwb-huffcode -P syn VSS $ cwb-compress-rdx -P syn VSS
In order to add structural attributes with computed start and end points (corpus positions), you can use the cwb-s-encode tool. The corresponding start and end positions of existing s-attributes can be obtained with cwb-s-decode. The following example adds information about sentence length to the VSS corpus.
$ cwb-s-decode VSS -S s > s.list $ awk 'BEGIN { FS=OFS="\t" } { print $1, $2, $2-$1+1 }' s.list > s_len.list $ cwb-s-encode -d /corpora/data/vss -f s_len.list -V s_len
Note that it is currently not necessary to run cwb-make after adding an s-attribute.
STRUCTURE s_lenor from the command line using the registry editor script:
$ cwb-regedit VSS :add :s s_len
Tables of corpus positions as input for cwb-s-encode can also be created from CQP query results using the dump or tabulate command in a CQP session.