3.3 Anchor points

the result of a (complex) query is a list of token sequences of variable length ( $\Rightarrow$ matches)
each match is represented by two anchor points:
match (corpus position of first token) and matchend (corpus position of last token)
set an additional target anchor with @ marker in query (prepended to a pattern)
> "in" @[pos="DT"] [lemma="case"];
$\to$ shown in bold font in KWIC display
only a single token can be marked as target; if multiple @ markers are used (or if the marker is in the scope of a repetition operator such a +), only the earliest matching token⁶ will be marked
> [pos="DT"] (@[pos="JJ.*"] ","?){2,} [pos="NNS?"];
when targeted pattern is optional, check how many matches have target anchor set
> A = [pos="DT"] @[pos="JJ"]? [pos="NNS?"];
> size A;
> size A target;
new in CQP v3.4.16: A second anchor position called keyword can also be set. The default notation is @1, but this can be changed with a user option (see Sec. 8.6 for details).
> "in" @[pos="DT"] @1[pos="J.*"]? [lemma="case"];
$\to$ keyword is underlined in KWIC display
each token pattern in a query can only be marked with one of the two anchors
anchor points allow a flexible specification of sort keys with the general form
> sort by attribute on start point .. end point ;
both start point and end point are specified as an anchor, plus an optional offset in square brackets; for instance, match[-1] refers to the token before the start of the match, matchend to the last token of the match, matchend[1] to the first token after the match, and target[-2] to a position two tokens after the target anchor
NB: the target anchor should only be used in the sort key when it is always defined
example: sort noun phrases by adjectives between determiner and noun
> [pos="DT"] [pos="JJ"]{2,} [pos="NNS?"];
> sort by word %cd on match[1] .. matchend[-1];
if end point refers to a corpus position before start point, the tokens in the sort keys are compared from right to left; e.g. sort on the left context of the match by token:
> sort by word %cd on match[-1] .. match[-42];
whereas the reverse option sorts on the left context by character:
> sort by word %cd on match[-42] .. match[-1] reverse;
complex sort operations can sometimes be sped up by using an external helper program (on Unix, the standard sort tool)⁷
> sort by word %cd;
> set ExternalSort on;
> sort by word %cd;
> set ExternalSort off;
the count command accepts the same specification for the strings to be counted
> count by lemma on match[1] .. matchend[-1];
display corpus positions of all anchor points in tabular format
> A = "behind" @[pos="JJ"]? [pos="NNS?"];
> dump A;
> dump A 9 14; $\quad$ (10 $^{\text{th}}$ – 15 $^{\text{th}}$ match)
the four columns correspond to the match, matchend, target and keyword (see Section 3.7) anchors; a value of -1 means that the anchor has not been set:
```
        1019887 1019888 -1      -1
        1924977 1924979 1924978 -1
        1986623 1986624 -1      -1
        2086708 2086710 2086709 -1
        2087618 2087619 -1      -1
        2122565 2122566 -1      -1
```
note that any prior sort or count command affects the ordering of the rows (so that the -th row corresponds to the -th line in a KWIC display obtained with cat)
the output of a dump command can be written (>) or appended (>>) to a file, if the first character of the filename is |, the ouput is sent to a pipe consisting of the command(s) that follow the |
pipe commands can only use programs that are (a) available on your operating system, and (b) accessible via you environment's PATH (or, named with their full filesystem location); while Linux, Mac OS, WSL⁸etc. have the standard set of Unix tools, including sort/gawk/uniq used in the trick discussed below, Windows does not - unless you take special measures to install them; commands involving pipes to programs you don't have will, of course, fail
use the following trick to display the distribution of match lengths in a named query result A:
> A = [pos="DT"] [pos="JJ.*"]* [pos="NNS?"];
> dump A > "| gawk '{print $2 - $1 + 1}' | sort -nr | uniq -c | less";
see Section 7.2 for an opposite to the dump command, which may be useful for certain tasks such as locating a specific corpus position