3 Input format extensions

Recent versions of CWB have added extended options for the format of the input files.

As of CWB v3.4.37, .xz is now a supported compression format in addition to .gz and .bz2 (as long as the relevant program, or 7-zip, is available), and compressed files are accepted for input and output by CQP and all CWB command-line tools.⁷ Moreover, it is possible to read from or write to shell pipes in these versions, by specifying a quoted filename that starts with a pipe character (|).

As of CWB v3.4.27, an alternative input format can be activated with the -n option, which requires all token lines to be numbered in the first TAB-separated column (see Fig. 2). The numbering itself is ignored but helps to make an unambiguous distinction between XML tags and token lines. This is a useful and robust alternative to encoding metacharacters as XML entities (see section 5), which many other command-line tools do not process correctly. The options -n and -x can safely be combined.

**Figure 2:** Verticalized text file example-numbered.vrt in -n format
$\begin{figure}\begin{quote} \begin{verbatim}<s> 1 The DT the 2 tag NN tag 3 <s... ...Z be 5 useful JJ useful 6 ! SENT ! </s>\end{verbatim} \end{quote} \end{figure}$

As of CWB v3.4.28, the token numbers in the first column can be captured in an additional p-attribute (e.g. id) by using the -N option instead of -n (e.g. -N id). In either case, comment lines starting with a hash (#) are now ignored silently. The attribute declared with -N will always be the first p-attribute in the registry file, followed by the first attribute declared with -p or the default word attribute.

CWB v3.4.28 also introduces support for processing empty lines as sentence breaks with the -L option,⁸ which implies -s. The segmentation is stored in a user-specified s-attribute, e.g. -L s. Note that this is a special “hidden” attribute, so explicit <s> and </s> tags will be treated as unknown elements. The combination of both options makes it possible to encode input files in one of the CoNLL formats⁹ without additional pre-processing:

$ cwb-encode -N id -L s -f conll.vrt ...

Several caveats apply:

CWB does not recognise any specific CoNLL flavour, i.e. all columns (except for the token numbers in the first column) have to be declared explicitly as p-attributes.
Multiword tokens (labelled with a number range, e.g. 3-4) and empty tokens (labelled e.g. as 5.1) are silently discarded.
All comment lines are discarded, even the special notation used for text structure boundaries and metadata in CoNLL-U (which is entirely misguided, of course). Such metadata comments can easily be converted into XML tags in a pre-processing step and encoded as described in Sec. 5.
All annotation columns are encoded as-is into positional attributes. Dependency relations or phrases in bracketing notation are not transformed into graph or tree structures (which are not supported by CWB 3); chunks in IOB notation are not expanded into a structural attribute.
CoNLL feature set notation can be transformed into CWB syntax, but this has to be requested explicitly on the command line for each attribute, e.g. -P morph/ (see further Sec. 6). Sets will also be re-sorted in alphabetical order with this option.