3.4 Frequency distributions

show frequency distribution of tokens (or their annotations) at anchor points
> group Go matchend pos;
set cutoff threshold with cut option to reduce size of frequency table
> NP = [pos="DT"] @[pos="JJ"]? [pos="NNS?"];
> group NP target lemma cut 50;
add optional offset to anchor point, e.g. distribution of words preceding matches
> group NP match[-1] lemma cut 100;
frequencies of token/annotation pairs (using different attributes or anchor points)
> group NP matchend word by target lemma;
> group Go matchend lemma by matchend pos;
Despite what the command syntax and output format suggest, results are sorted by pair frequencies (not grouped by the second item). The order of the two items in the output is opposite to the order in the group command.
you can write the output of the group command to a text file (or pipe)
> group NP target lemma cut 10 > "adjectives.go";
(in CQP v.3.4.11 and newer, the file is automatically compressed if it ends in .gz or .bz2; see Sec. 3.1 above)
new in CQP v3.4.9: use group by instead of by for nested frequency counts
> group Go matchend lemma group by matchend pos;
where an optional cut clause applies to the individual pairs
new in CQP v3.4.26: Compute document frequencies based on s-attribute regions rather than token frequencies by adding the within keyword (before cut). The example below counts the number of novels in which each distinct lemma occurs in the go and X construction rather than its overall frequency.
> group Go matchend lemma within novel cut 3;
Any items outside regions of the selected s-attribute are silently discarded in the frequency counts. The same happens for undefined anchor points in a simple grouping, because they cannot be assigned to any region. Notice that the top entry (none) is no longer present in the the paragraph-frequency count below.
> group NP target lemma within p cut 50;
The second example counts head nouns in chapter and novel titles, silently discarding all other occurrences. Keep in mind that repetitions within the same title will be counted only once; add a within constraint to the initial CQP query if you want a token frequency count within titles.
> group NP matchend lemma within title cut 5;
In the case of a group ... by, both elements must be contained in the same s-attribute region; otherwise the pair is silently discarded. It is valid for one of the anchors to be undefined, so the output of the commands below still includes (none) entries for NPs without adjective:
> group NP target lemma group by matchend lemma within novel cut 10;
> group NP matchend lemma by target lemma within novel cut 10;
Computation of document frequencies is only possible if the s-attribute regions are traversed in corpus order by the query result. This will usually be the case and is guaranteed for anchors set in a CQP query with matching within constraint. However, a set target operation with a large search context can sometimes result in out-of-order anchors. In this case, the frequency count will abort with an error message.
> set NP keyword nearest [pos="JJ.*"] within s;
> group NP keyword lemma within np; # keyword anchors traverse NPs out of order