- for many applications it is important to compute frequency tables for
the matching strings, tokens in the immediate context, attribute values at
different anchor points, different attributes for the same anchor, or
various combinations thereof
- frequency tables for the matching strings, optionally normalised to
lowercase and extended or reduced by an offset, can easily be computed with
the count command (cf. Sections 2.9 and
3.3); when pretty-printing is deactivated (cf. Section 7.1), its output has the form
frequency TAB first line
TAB string (type)
- advantages of the count command:
- strings of arbitrary length can be counted
- frequency counts can be based on normalised strings (
%cd
flags)
- the instances (tokens) for a given string type can easily be
identified, since the underlying query result is automatically sorted by
the count command, so that these instances appear as a
block starting at match number first line
- an alternative solution is the group command (see
Sec. 3.4), which computes frequency distributions over
single tokens (i.e. attribute values at a given anchor position) or pairs
of tokens (recall the counter-intuitive command syntax for this case); when
pretty-printing is deactivated, its output has the form
attribute value TAB attribute value
TAB frequency
- advantages of the group command:
- can compute joint frequencies for non-adjacent tokens
- faster when there are relatively few different types to be counted
- supports frequency distributions for the values of s-attributes
- the advantages of group and count are for the most part complementary
(e.g. it is not possible to normalise the values of s-attributes, or to
compute joint frequencies of two non-adjacent multi-token strings); in
addition, they have some common weaknesses, such as relatively slow
execution, no options for filtering and pooling data, and limitations on the
types of frequency distributions that can be computed (only simple joint
frequencies, no nested groupings)
- new in CQP v3.4.9: The group command has been re-implemented
with a hash-based algorithm. It is very fast now, even for large frequency
tables. The other limitations still apply, though.
- therefore, it is often necessary (and usually more efficient) to
generate frequency tables with external programs such as dedicated software
for statistical computing or a relational database; these tools need a
data table as input, which lists the relevant feature values (at
specified anchor positions) and/or multi-token strings for each match in the
query result; such tables can often be created from the output of
cat (using suitable PrintOptions, Context and
show settings)
- this procedure involves a considerable amount of re-formatting (e.g. with Unix command-line tools or Perl scripts) and can easily break when
there are unusual attribute values in the data; both cat output and
the re-formatting operations are expensive, making this solution inefficient
when there is a large number of matches
- in most situations, the tabulate command provides a more
convenient, more robust and faster solution; the general form is
> tabulate A
column spec, column
spec, ... ;
this will print a TAB-separated table where each row corresponds to
one match of the query result A and the columns are described by
one or more column spec(ification)s
- just as with dump and cat, the table can be restricted
to a contiguous range of matches, and the output can be redirected to a file
or pipe
> tabulate A 100 119
column spec, column
spec, ... ;
> tabulate A
column spec, column
spec, ... > "data.tbl";
- each column specification consists of a single anchor (with optional
offset) or a range between two anchors, using the same syntax as
sort and count; without an attribute name, this
will print the corpus positions for the selected anchor, so
> tabulate A match, matchend, target, keyword;
produces exactly the same output as dump A; when target and keyword
anchors are defined for the query result A; otherwise, it will
print an error message (and you need to leave out the column specs
target and/or keyword)
- when an attribute name is given after the anchor, the values of this
attribute for the selected anchor point will be printed; both positional
and structural attributes with annotated values can be used; the following
example prints a table of novel title, book number and chapter title for a
query result from the DICKENS corpus
> tabulate A match novel_title, match book_num, match chapter_title;
note that undefined values (for the book_num
and chapter_title
attributes) are represented in the tabulation output by the empty string
- if an anchor point is undefined or falls outside the corpus (because of
an offset), tabulate prints an empty string or the corpus position
-1 (correct behaviour implemented in v3.4.10)
- a range between two anchor points prints the values of the selected
attribute for all tokens in the specified range; usually, this only makes
sense for positional attributes; the following example prints the
lemma values for 5 tokens before and after each match;
this data can be used to identify collocates of the items matched by the query
> tabulate A match[-5]..match[-1] lemma, matchend[1]..matchend[5] lemma;
43the attribute values for tokens within each range are separated by
blanks rather than TABs, in order to avoid ambiguities in the
resulting data table
- any items in an anchor-point range that fall outside the bounds of the
corpus are printed as empty strings or corpus positions -1;
if either the start or end of the range is an undefined anchor, a single
empty string or cpos -1 is printed for the entire range
(correct behaviour implemented in v3.4.10)
- the end position of a range must not be smaller than its start position,
so take care to order items properly and specify sensible offsets; in
particular, a range specification such as
match .. target
must not be
used if the target anchor might be to the left of the match; the behaviour
of CQP in such cases is unspecified
- attribute values can be normalised with the flags
%c
(to
lowercase) and %d
(remove diacritics); the command below uses Unix
shell commands to compute the same frequency distribution as
count A by word %c;
in a much more efficient manner
> tabulate A match .. matchend word %c > "| sort | uniq -c | sort -nr";
- note that in contrast to the behaviour of sort and count,
a range is considered empty when the end point lies before the start point;
such a range will always be printed as an empty string