- new in v3.4.31: CQP now has limited support for ad-hoc
structural annotation, which does not need to be indexed in the
form of an s-attribute.25
- The core element of the new syntax is <<name>>
to match an entire token span, e.g. a region of the s-attribute
name. We will therefore refer to this item as a
region element.26 The two queries below are fully equivalent:
> "dine" [pos = "IN"] <<np>>;
> "dine" [pos = "IN"] <np> []+ </np>;
- The implementation of <<np>> in the first of these queries is almost
exactly the same as that used for the second query: CQP will step
through all tokens of the region until it reaches its end,
so both queries have very similar execution times.
But a key difference is that <<np>> doesn't allow any constraints
on the content of the region or on annotation values of the s-attribute;
it also needs less trickery behind the scenes to match up start and end tags
and is therefore less prone to issues with obscure corner cases of CQP
queries.27
- Structural annotation can sometimes be generated with a
(usually rather complex) CQP query, e.g. noun phrase chunking with a query such as
> NP = (?longest) [pos = "DT"]?
( [pos = "JJ.*"] ([word=",|and|or" | pos="RB"]* [pos = "JJ.*"])* )?
[pos = "NNS?"]+;
It will often be desirable to reuse such noun chunks in other queries,
and the customary recommendation is to define a suitable macro /np[]
(cf. Sec. 6.4), which can then be integrated into
queries for larger patterns such as “NP about NP”:
> /np[] "about" /np[];
This approach results in extremely complex and slow queries that are difficult
to debug. It also makes it necessary to manage a macro library in order
to use such query fragments across multiple CQP sessions
(with different versions of the macro for different tagsets).
- One possibility – which has already been used to implement YAC, a
cascaded finite-state chunk parser for German (see Sec. 1.1)
– is to dump the query result and index it as a structural
attribute in the corpus with cwb-s-encode.
The query for “NP about NP” can then be executed more efficiently as
> <<np>> "about" <<np>>;
in the new region element syntax. This approach has several drawbacks:
- the end user must have write access to the corpus data directory
and registry file, and a suitable wrapper infrastructure has to be implemented;
- registry entries and show cd; may be cluttered with
a large number of undocumented s-attributes if multiple users
have write access to the corpus;
- CQP has to be restarted in order to recognise the newly indexed s-attributes;
- a query result may contain overlapping and even nested spans,28which are not allowed in s-attributes; care has to be taken to
discard the problematic matches beforehand.
- Region elements provide a convenient and powerful alternative:
<<NP>> interprets the named query result NP as structural annotation
and matches any of its match...matchend spans.29 We can thus immediately search for “NP about NP” with the query
> <<NP>> "about" <<NP>>;
Named query results thus provide flexible ad-hoc annotation.
They can directly be used (and updated) as temporary annotation
in a running CQP session, but can also be persisted to disk
(save NP;) and will be loaded on-demand in a new session.
Ad-hoc annotation does not need special file access permissions
and is private to each user, but can also easily be shared with other
users (by copying the saved query results).
- If naming conventions are observed (lowercase for s-attribute names,
CamelCase for named query results), there can be no conflict between
the two uses of region elements. Otherwise, a query result always takes
precedence, since it is the main application of the new feature.
- To search for spans between other anchors in the query result,
make a copy of the NQR and use set to modify the matching span.
Assuming that (for instance) a target anchor has been set on the first
noun in NP,30 we would type
> PreNominal = NP;
> set PreNominal matchend target[-1] !;
in order to be able match the prenominal elements of a noun phrase
with <<PreNominal>>. The set ... !; operation
automatically discards any matches consisting only of nouns,
because then target[-1] precedes the match anchor
(resulting in an invalid span).31
- Of course, region elements can be repeated with quantifiers,
included in a complex subexpression, etc. As a simple example,
consider the query
> <<NP>> ( [pos = "IN|TO"] <<NP>> ){5};
- Let us now consider a query result NamedEntity that
contains tentatively identified named entities in the corpus:
> NamedEntity = (?traditional)
<np> []* [pos = "NPS?"] </np> | <np1> []* [pos = "NPS?"] </np1>;
where the traditional matching strategy enables nested matches
for complex named entities such as the [Lord Mayor of
[London]]. We can then find a preposition followed by a tentative
named entity:
> (?traditional) [pos = "IN|TO"] <<NamedEntity>>;
taking all nested and/or overlapping NE candidates into account,
which would not be possible with an s-attribute. Again, the
traditional matching strategy enables nested matches
(e.g. look for the line in the presence of the Ghost of Christmas).
- Such named entity candidates are often generated by an
external annotation tool and would traditionally be encoded
as an s-attribute. In order to allow for overlapping and nested NEs,
they can be undumped into a NQR instead. If there is an additional
categorisation of NEs, multiple NQRs (NEperson, NEloc,
NEorg, ...) can be created for the different classes.
- Labels and target markers can be set on the first and last
token of the range matched by a region element, using the same notation
as for token expressions. Those for the first token are inserted
immediately after the opening
<<
, those for the last token
immediately before the closing >>
. It is recommended (and
sometimes necessary) to separate them from the element name with
whitespace in order to avoid parsing errors.
- The example below uses labels to ensure that the second NP is
at least 3 tokens long and it sets target markers on the end of the
first NP and the start of the second (with the other boundaries
implicity given by the match and matchend anchors).
> <<NP @0>> "after" <<@1 a: NP b:>> :: distabs(a, b) >= 2;
- Zero-width region elements are indicated by a slash
after the element name, e.g.
<<NP/>>
. They match at the
start of a suitable span, but do not consume any tokens.
Because there might be multiple spans at the same position,
a label or target marker can only be set on the first token of the span.
Such zero-width region elements are of little use for s-attributes:
<<np/>>
behaves almost exactly like the XML start tag <np>
,
except that it allows a label to be set and does not match up with
a corresponding end tag </np>
.
- The main purpose of zero-width region elements is to enable
anchored queries, which take their potential starting point
from a previous query result. This is similar to the earlier strategy
of running a subquery starting with a
<match>
anchor
(see Sec. 6.3), but much more convenient;
moreover it allows for overlapping matches.
- A query with a highly selective element near the end such as
> "the"%c [pos="JJ.*"]{3,} [lemma="creature"];
can now conveniently be sped up32by a pre-filtering strategy:
> Cand = MU(meet "the"%c [lemma="creature"] 1 5);
> <<Cand/>> "the"%c [pos="JJ.*"]{3,} [lemma="creature"];