8 Decoding and analysing corpora

The cwb-lexdecode tool provides access to the lexicon of positional attributes, i.e. lists of all word forms or annotation strings (types) with their corpus frequencies. The -S option prints the size of corpus (tokens) and lexicon (types) only, -P selects the desired p-attribute, -f shows corpus frequencies, and -s lists the lexicon entries alphabetically (according to the internal sort order). In order to sort the lexicon by frequency, an external program (e.g. sort) has to be used.

$ cwb-lexdecode -S    -P lemma VSS
$ cwb-lexdecode -f -s -P lemma VSS | tail -20
$ cwb-lexdecode -f    -P lemma VSS | sort -nr -k 1 | head -20

It is also possible to annotate strings from a file (called tags.txt here) with corpus frequencies. The file must be in one-word-per-line format. -0 (digit zero) prints a frequency of 0 for unknown strings rather than issuing a warning message; it can be combined with -f to the mnemonic form -f0.

$ cwb-lexdecode -f0 -P pos -F tags.txt VSS

With the -p option, word forms or annotations matching a regular expression can be extracted. Case-insensitive and accent-insensitive matching is selected with -c and -d, respectively. The example below is similar to the CQP query [lemma = "over.+" %c]; but may be considerably faster on a large corpus.

$ cwb-lexdecode -f -P lemma -p "over.+" -c VSS

An entire corpus or selected attributes from a corpus can be printed in various formats with the cwb-decode tool. Note that options and switches must appear before the corpus name, and the flags used to select attributes after the corpus name. Use -P to select p-attributes and -S for s-attributes. With the -s and -e options, a part of the corpus (identified by start and end corpus position) can be printed.

$ cwb-decode -C -s 7299 -e 7303  VSS  -P word -P pos -S s

-C refers to the compact one-word-per-line format expected by cwb-encode. For a full textual copy of a CWB corpus, use -ALL to select all positional and structural attributes.

$ cwb-decode -C  VSS  -ALL  > vss-corpus.vrt

The resulting file vss-corpus.vrt can be re-encoded with cwb-encode (using appropriate flags) to give an exact copy of the VSS corpus. -Cx is almost identical to the compact format, but changes some details in order to generate a well-formed XML document (unless there are overlapping regions or s-attributes with “simple” annotations).¹⁵

$ cwb-decode -Cx  VSS  -ALL  > vss-corpus.xml

This output format can reliably be re-encoded if the -xsB options are used (see section 5).

As of CWB v3.4.33, the opposite round-trip is also supported, i.e. it is possible to reconstruct a .vrt input file almost exactly. To this end, nested XML regions and attribute-value pairs in start tags, which have been broken up into separate s-attributes by cwb-encode as described in Sec. 5, need to be recombined by giving corresponding -S specifications to cwb-decode.

Reconstruct the file vss.vrt (Fig. 3) from the CWB corpus VSS indexed in Sec. 5:¹⁶

$ cwb-decode -C VSS -P word -P pos -P lemma
             -S s -S p -S story+num+title > vss_decoded.vrt

Also decode the nested NP and PP elements added in Sec. 7 with these additional declarations:

$ cwb-decode -C VSS -P word -P pos -P lemma
             -S s -S p -S story+num+title -S np:2+head -S pp:2+head
             > vss_decoded.vrt

Finally, -X produces a native XML output format (following a fixed DTD), which can be post-processed and formatted with XSLT stylesheets.

$ cwb-decode -X -s 7299 -e 7303  VSS  -P word -P pos -S s -S np_head

Note that the regions of s-attributes are not translated into XML regions. Instead, the start and end tags are represented by special empty <tag> elements.

As of CWB v3.4.28, the cwb-encode and cwb-decode utilities provide improved support for reading and writing CoNLL-style formats; see section 2 for details and limitations. Section 3 covers how to index CoNLL files. Such a corpus can easily be decoded back into CoNLL format, with the option -b s adding a blank line after each sentence region:

$ cwb-decode -C -b s CONLL_CORPUS -P id -P word -P pos ...

If token numbers haven't been indexed explicitly, use numbered output mode (-Cn) to insert corpus positions as placeholders in the first output column:

$ cwb-decode -Cn CONLL_CORPUS -P word -P pos ...

An alternative strategy is to extract all sentence regions and decode them in “matchlist mode”, which automatically adds blank lines as delimiters. In this approach, comment lines with metadata information can be added at the start of each sentence using -V flags:

$ cwb-s-decode CONLL_CORPUS -S s | 
  cwb-decode -Cn -p CONLL_CORPUS -P word -P pos ... -V text_id -V s_num

It is then also possible to decode a subset of the sentences, by running a suitable CQP query (with expand to s) and dumping the corresponding corpus positions. See man cwb-decode for examples.

cwb-scan-corpus computes combinatorial frequency tables for an encoded corpus. Similar to the group command in CQP, it is a faster and more memory-efficient alternative for the extraction of simple structures from large corpora, and is not restricted to singletons and pairs. The output of cwb-scan-corpus is an unordered list of $n$ -tuples and their frequencies, which have to be post-processed and sorted with external tools. The simple example below prints the twenty most frequent (lemma, pos) pairs in the VSS corpus, using the -C option to filter punctuation and noise from the list of lemmata (note that -C applies to all selected attributes). ¹⁷

$ cwb-scan-corpus -C VSS lemma pos | sort -nr -k 1 | head -20

A non-negative offset can be added to each field key in order to collect bigrams, trigrams, etc. The following example derives a simple language model in the form of all sequences of three consecutive part-of-speech tags together with their occurrence counts. Only the twenty most frequent sequences are displayed.

$ cwb-scan-corpus VSS pos+0 pos+1 pos+2 | sort -nr -k 1 | head -20

For a large corpus such as the BNC, the scan results can directly be written to a file with the -o switch. If the filename ends in .gz, .bz2 or .xz (such as the file language-model.gz in the example below), the output file is automatically compressed (subject to the caveats discussed in Sec. 2).

$ cwb-scan-corpus -o language-model.gz BNC pos+0 pos+1 pos+2

The values of the selected p-attributes can also be filtered with regular expressions. The following command identifies part-of-speech sequences at the end of sentences (indicated by the tag SENT = sentence-ending punctuation).

$ cwb-scan-corpus VSS pos+0 pos+1 pos+2=/SENT/ | sort -nr -k 1 | head -20

Since the third key is used only for filtering, we can suppress it in the output by marking it as a constraint key with the ? character.

$ cwb-scan-corpus VSS pos+0 pos+1 ?pos+2=/SENT/ | sort -nr -k 1 | head -20

cwb-scan-corpus can operate both on p-attributes and on s-attributes with annotated values. For instance, to obtain by-story frequency lists for the VSS corpus, use the following command:

$ cwb-scan-corpus -o freq-by-story.tbl VSS lemma+0 story_title+0

As a special case, s-attributes without annotated values can be used to restrict the corpus scan to regions of a particular type. For instance, the constraint key ?footnote would only scan <footnote> regions. Keep in mind that such special constraints must not include a regular expression part.

The final example extracts pairs of adjacent adjectives and nouns from the VSS corpus, e.g. as candidate data for adjective-noun collocations. Constraint keys are used to identify adjectives and nouns, and only nouns starting with a vowel are accepted here. Note the c and d modifiers (case- and diacritic-insensitive matching) on this regular expression. It is recommended to put all keys with non-trivial constraints in single quotes in order to avoid misinterpretation of shell metacharacters.

$ cwb-scan-corpus -C VSS lemma+0 '?pos+0=/JJ.*/' 
                         'lemma+1=/[aeiou].+/cd' '?pos+1=/NN.*/'

Except for the -C option, this command line is equivalent to the following CQP commands, but it will execute much faster on a large corpus.

> A = [pos = "JJ.*"] [pos = "NN.*" & lemma = "[aeiou].+" %cd];
> group A matchend lemma by match lemma;

The cwb-scan-corpus command is limited to relatively simple constraints on tokens, and it can only match patterns with fixed offsets (but not e.g. determiner and noun separated by an arbitrary number of adjectives). To obtain frequency tables for more complex patterns, use CQP queries in combination with the tabulate function. The resulting data tables can be saved to disk and loaded into a relational database or processed with some other software package for statistical analysis.

As of CWB v3.4.26, n-grams can be restricted with the -w option to be contained in a single region of a specified s-attribute, similar to the within constraint of a CQP query. List the most frequent POS trigrams inside noun phrases with

$ cwb-scan-corpus -f 10 -w np VSS pos+0 pos+1 pos+2 | sort -nr

(Note that a hidden constraint key is added so that the scan will skip efficiently from the end of one region to the start of the next.)

Only a single -w constraint can be specified, but normal existence constraints can be used to restrict the scan further, e.g. to NPs within a PP:

$ cwb-scan-corpus -f 10 -w np VSS pos+0 pos+1 pos+2 ?pp+0 | sort -nr

This will only work correctly if the -w regions are fully contained in the regions tested with existence constrains.

It is also possible to compute document frequencies based on an arbitrary s-attribute, using the -d option. This command will list all lemmas that occur in all six stories of the VSS collection:

$ cwb-scan-corpus -f 6 -d story VSS lemma+0

The -d option automatically enforces a corresponding -w constraint. It cannot be combined with an explicit -w option (i.e. -d story -w story is invalid), nor with the -F option for summing over pre-computed frequency counts.