Recent versions of CWB have added extended options for the format of the input files.
As of CWB v3.4.37, .xz is now a supported compression format
in addition to .gz and .bz2 (as long as the relevant program, or 7-zip, is available),
and compressed files are accepted for input and output by CQP and all CWB command-line tools.7 Moreover, it is possible to read from or write to shell pipes in these versions,
by specifying a quoted filename that starts with a pipe character (|
).
As of CWB v3.4.27, an alternative input format can be activated with the -n option, which requires all token lines to be numbered in the first TAB-separated column (see Fig. 2). The numbering itself is ignored but helps to make an unambiguous distinction between XML tags and token lines. This is a useful and robust alternative to encoding metacharacters as XML entities (see section 5), which many other command-line tools do not process correctly. The options -n and -x can safely be combined.
As of CWB v3.4.28, the token numbers in the first column can be captured in an additional p-attribute (e.g. id) by using the -N option instead of -n (e.g. -N id). In either case, comment lines starting with a hash (#) are now ignored silently. The attribute declared with -N will always be the first p-attribute in the registry file, followed by the first attribute declared with -p or the default word attribute.
CWB v3.4.28 also introduces support for processing empty lines as sentence
breaks with the -L option,8 which implies -s. The segmentation is
stored in a user-specified s-attribute, e.g. -L s. Note that this is
a special “hidden” attribute, so explicit <s>
and </s>
tags
will be treated as unknown elements. The combination of both options makes it
possible to encode input files in one of the CoNLL formats9 without additional pre-processing:
$ cwb-encode -N id -L s -f conll.vrt ...
Several caveats apply: