3 Input format extensions

Recent versions of CWB have added extended options for the format of the input files.

As of CWB v3.4.37, .xz is now a supported compression format in addition to .gz and .bz2 (as long as the relevant program, or 7-zip, is available), and compressed files are accepted for input and output by CQP and all CWB command-line tools.7 Moreover, it is possible to read from or write to shell pipes in these versions, by specifying a quoted filename that starts with a pipe character (|).

As of CWB v3.4.27, an alternative input format can be activated with the -n option, which requires all token lines to be numbered in the first TAB-separated column (see Fig. 2). The numbering itself is ignored but helps to make an unambiguous distinction between XML tags and token lines. This is a useful and robust alternative to encoding metacharacters as XML entities (see section 5), which many other command-line tools do not process correctly. The options -n and -x can safely be combined.

Figure 2: Verticalized text file example-numbered.vrt in -n format
\begin{figure}\begin{quote}
\begin{verbatim}<s>
1 The DT the
2 tag NN tag
3 <s...
...Z be
5 useful JJ useful
6 ! SENT !
</s>\end{verbatim}
\end{quote}
\end{figure}

As of CWB v3.4.28, the token numbers in the first column can be captured in an additional p-attribute (e.g. id) by using the -N option instead of -n (e.g. -N id). In either case, comment lines starting with a hash (#) are now ignored silently. The attribute declared with -N will always be the first p-attribute in the registry file, followed by the first attribute declared with -p or the default word attribute.

CWB v3.4.28 also introduces support for processing empty lines as sentence breaks with the -L option,8 which implies -s. The segmentation is stored in a user-specified s-attribute, e.g. -L s. Note that this is a special “hidden” attribute, so explicit <s> and </s> tags will be treated as unknown elements. The combination of both options makes it possible to encode input files in one of the CoNLL formats9 without additional pre-processing:

$ cwb-encode -N id -L s -f conll.vrt ...
Several caveats apply: