6.3 Subqueries

queries can be limited to the matching regions of a previous query ( $\Rightarrow$ subqueries)
activate named query instead of system corpus (here: sentences containing interest)
DICKENS> First = [lemma = "interest"] expand to s;
DICKENS> First;
DICKENS:First[624]>
NB: matches of the activated query must be non-overlapping¹⁵
the matches of the named query First now define a virtual structural attribute on the corpus DICKENS with the special name match
all following queries are evaluated with an implicit within match clause
(an additional explicit within clause may be specified as well)
re-activate system corpus to exit subquery mode
DICKENS:First[624]> DICKENS;
DICKENS>
XML tag notation can also be used for the temporary match regions
> <match> [pos = "W.*"];
to find tokens matching the given pattern at the start of an activated region
if target/keyword anchors are set in the activated query result, corresponding XML tags (<target>, <keyword>, ...) can be used, too
> </target> []* </match>;
$\to$ range from the target anchor to end of match, but excluding target
<target> and <keyword> regions always have length 1 !
a subquery that starts with an anchor tag can be evaluated very efficiently
appending the keep operator ! turns the subquery into a filter, i.e. it returns all ranges from the activated query result that contain a match of the subquery (equivalent to an implicit expand to match)

Subqueries can serve a range of different purposes, especially for advanced users. The examples below illustrate three typical applications.

Searching a subcorpus

select entire texts (or suitable sub-text regions) based on metadata annotation to define a subcorpus, making sure to expand matches appropriately
> HardTimes = <novel_title = "Hard Times"> [] expand to novel;
43 combine multiple queries with set operators (union, diff, intersect) for complex metadata restrictions
after activating the named query, all following queries will be restricted to the subcorpus
> HardTimes;
DICKENS:HardTimes[1]> [lemma = "hard"];
we can also define a subcorpus by content, e.g. all paragraphs that mention horses
> HorseSubcorpus = [lemma = "horse" expand to p];
> HorseSubcorpus;

Iterative refinement of queries

start with a fairly general query, e.g. for a prepositional phrase with a particular head noun
> A = [pos = "IN"] [pos != "[NP].*"]{0,6} [lemma = "dog"] within s;
> cat A;
use subqueries as filters (i.e. with the keep operator !) to apply further constraints to the matches; this is often easier than working all constraints into the original query
e.g. limit to PPs containing an adjective
> A;
DICKENS:A[127]> B = [pos = "JJ.*"] !;
DICKENS:A[127]> cat B;
activate new query result B to apply further filters; this can also be used to exclude false positives (FP) from the matches
e.g. remove false positives that contain punctuation (by specifying the Unicode property escape sequence \pP¹⁶, as a full or partial token) or that begin with that or as
DICKENS:A[127]> B;
DICKENS:B[35]> FP = <match> "that|as"%c | ".*\pP.*" !;
DICKENS:B[35]> C = diff B FP;
DICKENS:B[35]> cat C;

Pre-filtering complex queries

a well-known deficit of CQP is that complex queries with a small result set may still run very slowly on large corpora if highly specific constraints appear only near the end of the query; this is exacerbated by many optional elements at the start of the query
a typical example is searching a noun phrase with a specific head noun, e.g.
> set Timing on;
> Horses = [pos="DT"]? ([pos="RB"]? [pos="JJ.*"])* [lemma="horse"];
since (correct) matches must occur within sentences, we can speed up the search by restricting it to sentences that contain the lemma horse
> Cand = [lemma = "horse"] expand to s;
> Cand;
DICKENS:Cand[545]> H2 = [pos="DT"]? ([pos="RB"]? [pos="JJ.*"])* [lemma="horse"];
the pre-filtered query should be executed 10 to 15 times more quickly, in this example
you may want to verify that both have exactly the same results:
> diff Horses H2;
> diff H2 Horses;