- show frequency distribution of tokens (or their annotations) at anchor points
> group Go matchend pos;
set cutoff threshold with cut option to reduce size of frequency table
> NP = [pos="DT"] @[pos="JJ"]? [pos="NNS?"];
> group NP target lemma cut 50;
- add optional offset to anchor point, e.g. distribution of words
preceding matches
> group NP match[-1] lemma cut 100;
- frequencies of token/annotation pairs (using different attributes or anchor points)
> group NP matchend word by target lemma;
> group Go matchend lemma by matchend pos;
Despite what the command syntax and output format suggest, results are
sorted by pair frequencies (not grouped by the second item).
The order of the two items in the output is opposite to the order in the
group command.
- you can write the output of the group command to a text file
(or pipe)
> group NP target lemma cut 10 > "adjectives.go";
(in CQP v.3.4.11 and newer, the file is automatically compressed if it ends
in .gz or .bz2; see Sec. 3.1 above)
- new in CQP v3.4.9: use group by instead of by for
nested frequency counts
> group Go matchend lemma group by matchend pos;
where an optional cut clause applies to the individual pairs
- new in CQP v3.4.26: Compute document frequencies based on s-attribute
regions rather than token frequencies by adding the within keyword
(before cut). The example below counts the number of novels in
which each distinct lemma occurs in the go and X construction rather
than its overall frequency.
> group Go matchend lemma within novel cut 3;
- Any items outside regions of the selected s-attribute are silently
discarded in the frequency counts. The same happens for undefined anchor
points in a simple grouping, because they cannot be assigned to any region.
Notice that the top entry (none) is no longer present in the the
paragraph-frequency count below.
> group NP target lemma within p cut 50;
The second example counts head nouns in chapter and novel titles, silently
discarding all other occurrences. Keep in mind that repetitions within the
same title will be counted only once; add a within constraint to
the initial CQP query if you want a token frequency count within titles.
> group NP matchend lemma within title cut 5;
- In the case of a group ... by, both elements must be contained in the
same s-attribute region; otherwise the pair is silently discarded. It is
valid for one of the anchors to be undefined, so the output of the commands
below still includes (none) entries for NPs without adjective:
> group NP target lemma group by matchend lemma within novel cut 10;
> group NP matchend lemma by target lemma within novel cut 10;
- Computation of document frequencies is only possible if the s-attribute
regions are traversed in corpus order by the query result. This will usually
be the case and is guaranteed for anchors set in a CQP query with matching
within constraint. However, a set target operation with a
large search context can sometimes result in out-of-order anchors.
In this case, the frequency count will abort with an error message.
> set NP keyword nearest [pos="JJ.*"] within s;
> group NP keyword lemma within np;
# keyword anchors traverse NPs out of order