7.3 Generating frequency tables

for many applications it is important to compute frequency tables for the matching strings, tokens in the immediate context, attribute values at different anchor points, different attributes for the same anchor, or various combinations thereof
frequency tables for the matching strings, optionally normalised to lowercase and extended or reduced by an offset, can easily be computed with the count command (cf. Sections 2.9 and 3.3); when pretty-printing is deactivated (cf. Section 7.1), its output has the form
frequency TAB first line TAB string (type)
advantages of the count command:
- strings of arbitrary length can be counted
- frequency counts can be based on normalised strings (%cd flags)
- the instances (tokens) for a given string type can easily be identified, since the underlying query result is automatically sorted by the count command, so that these instances appear as a block starting at match number first line
an alternative solution is the group command (see Sec. 3.4), which computes frequency distributions over single tokens (i.e. attribute values at a given anchor position) or pairs of tokens (recall the counter-intuitive command syntax for this case); when pretty-printing is deactivated, its output has the form
attribute value TAB attribute value TAB frequency
advantages of the group command:
- can compute joint frequencies for non-adjacent tokens
- faster when there are relatively few different types to be counted
- supports frequency distributions for the values of s-attributes
the advantages of group and count are for the most part complementary (e.g. it is not possible to normalise the values of s-attributes, or to compute joint frequencies of two non-adjacent multi-token strings); in addition, they have some common weaknesses, such as relatively slow execution, no options for filtering and pooling data, and limitations on the types of frequency distributions that can be computed (only simple joint frequencies, no nested groupings)
new in CQP v3.4.9: The group command has been re-implemented with a hash-based algorithm. It is very fast now, even for large frequency tables. The other limitations still apply, though.
therefore, it is often necessary (and usually more efficient) to generate frequency tables with external programs such as dedicated software for statistical computing or a relational database; these tools need a data table as input, which lists the relevant feature values (at specified anchor positions) and/or multi-token strings for each match in the query result; such tables can often be created from the output of cat (using suitable PrintOptions, Context and show settings)
this procedure involves a considerable amount of re-formatting (e.g. with Unix command-line tools or Perl scripts) and can easily break when there are unusual attribute values in the data; both cat output and the re-formatting operations are expensive, making this solution inefficient when there is a large number of matches
in most situations, the tabulate command provides a more convenient, more robust and faster solution; the general form is
> tabulate A column spec, column spec, ... ;
this will print a TAB-separated table where each row corresponds to one match of the query result A and the columns are described by one or more column spec(ification)s
just as with dump and cat, the table can be restricted to a contiguous range of matches, and the output can be redirected to a file or pipe
> tabulate A 100 119 column spec, column spec, ... ;
> tabulate A column spec, column spec, ... > "data.tbl";
each column specification consists of a single anchor (with optional offset) or a range between two anchors, using the same syntax as sort and count; without an attribute name, this will print the corpus positions for the selected anchor, so
> tabulate A match, matchend, target, keyword;
produces exactly the same output as dump A; when target and keyword anchors are defined for the query result A; otherwise, it will print an error message (and you need to leave out the column specs target and/or keyword)
when an attribute name is given after the anchor, the values of this attribute for the selected anchor point will be printed; both positional and structural attributes with annotated values can be used; the following example prints a table of novel title, book number and chapter title for a query result from the DICKENS corpus
> tabulate A match novel_title, match book_num, match chapter_title;
note that undefined values (for the book_num and chapter_title attributes) are represented in the tabulation output by the empty string
if an anchor point is undefined or falls outside the corpus (because of an offset), tabulate prints an empty string or the corpus position -1 (correct behaviour implemented in v3.4.10)
a range between two anchor points prints the values of the selected attribute for all tokens in the specified range; usually, this only makes sense for positional attributes; the following example prints the lemma values for 5 tokens before and after each match; this data can be used to identify collocates of the items matched by the query
> tabulate A match[-5]..match[-1] lemma, matchend[1]..matchend[5] lemma;
43the attribute values for tokens within each range are separated by blanks rather than TABs, in order to avoid ambiguities in the resulting data table
any items in an anchor-point range that fall outside the bounds of the corpus are printed as empty strings or corpus positions -1; if either the start or end of the range is an undefined anchor, a single empty string or cpos -1 is printed for the entire range (correct behaviour implemented in v3.4.10)
the end position of a range must not be smaller than its start position, so take care to order items properly and specify sensible offsets; in particular, a range specification such as match .. target must not be used if the target anchor might be to the left of the match; the behaviour of CQP in such cases is unspecified
attribute values can be normalised with the flags %c (to lowercase) and %d (remove diacritics); the command below uses Unix shell commands to compute the same frequency distribution as count A by word %c; in a much more efficient manner
> tabulate A match .. matchend word %c > "| sort | uniq -c | sort -nr";
note that in contrast to the behaviour of sort and count, a range is considered empty when the end point lies before the start point; such a range will always be printed as an empty string