6.3 Subqueries
- queries can be limited to the matching regions of a previous query (
subqueries)
- activate named query instead of system corpus (here: sentences
containing interest)
DICKENS> First = [lemma = "interest"] expand to s;
DICKENS> First;
DICKENS:First[624]>
NB: matches of the activated query must be non-overlapping15
- the matches of the named query First now define a
virtual structural attribute on the corpus DICKENS with the
special name match
- all following queries are evaluated with an implicit
within match clause
(an additional explicit within clause may be specified as well)
- re-activate system corpus to exit subquery mode
DICKENS:First[624]> DICKENS;
DICKENS>
- XML tag notation can also be used for the temporary match
regions
> <match> [pos = "W.*"];
to find tokens matching the given pattern at the start of an
activated region
- if target/keyword anchors are set in the activated
query result, corresponding XML tags (<target>,
<keyword>, ...) can be used, too
> </target> []* </match>;
range from the target anchor to end of match, but excluding
target
<target> and <keyword> regions always have length 1 !
- a subquery that starts with an anchor tag can be evaluated very efficiently
- appending the keep operator ! turns the subquery into
a filter, i.e. it returns all ranges from the activated query result that
contain a match of the subquery (equivalent to an implicit expand to match)
Subqueries can serve a range of different purposes, especially for advanced users.
The examples below illustrate three typical applications.
Searching a subcorpus
- select entire texts (or suitable sub-text regions) based on metadata
annotation to define a subcorpus, making sure to expand matches appropriately
> HardTimes = <novel_title = "Hard Times"> [] expand to novel;
43 combine multiple queries with set operators (union,
diff, intersect) for complex metadata restrictions
- after activating the named query, all following queries will be restricted
to the subcorpus
> HardTimes;
DICKENS:HardTimes[1]> [lemma = "hard"];
- we can also define a subcorpus by content, e.g. all paragraphs that
mention horses
> HorseSubcorpus = [lemma = "horse" expand to p];
> HorseSubcorpus;
Iterative refinement of queries
- start with a fairly general query, e.g. for a prepositional phrase
with a particular head noun
> A = [pos = "IN"] [pos != "[NP].*"]{0,6} [lemma = "dog"] within s;
> cat A;
- use subqueries as filters (i.e. with the keep operator
!) to apply further constraints to the matches; this is often
easier than working all constraints into the original query
- e.g. limit to PPs containing an adjective
> A;
DICKENS:A[127]> B = [pos = "JJ.*"] !;
DICKENS:A[127]> cat B;
- activate new query result B to apply further filters; this can also
be used to exclude false positives (FP) from the matches
- e.g. remove false positives that contain punctuation (by specifying the
Unicode property escape sequence
\pP
16,
as a full or partial token) or that begin with that or as
DICKENS:A[127]> B;
DICKENS:B[35]> FP = <match> "that|as"%c | ".*\pP.*" !;
DICKENS:B[35]> C = diff B FP;
DICKENS:B[35]> cat C;
Pre-filtering complex queries
- a well-known deficit of CQP is that complex queries with a small result
set may still run very slowly on large corpora if highly specific
constraints appear only near the end of the query; this is exacerbated by
many optional elements at the start of the query
- a typical example is searching a noun phrase with a specific head noun, e.g.
> set Timing on;
> Horses = [pos="DT"]? ([pos="RB"]? [pos="JJ.*"])* [lemma="horse"];
- since (correct) matches must occur within sentences, we can speed up the
search by restricting it to sentences that contain the lemma horse
> Cand = [lemma = "horse"] expand to s;
> Cand;
DICKENS:Cand[545]> H2 = [pos="DT"]? ([pos="RB"]? [pos="JJ.*"])* [lemma="horse"];
- the pre-filtered query should be executed 10 to 15 times more quickly, in this example
- you may want to verify that both have exactly the same results:
> diff Horses H2;
> diff H2 Horses;