8.7 Region elements and ad-hoc annotation

new in v3.4.31: CQP now has limited support for ad-hoc structural annotation, which does not need to be indexed in the form of an s-attribute.²⁵
The core element of the new syntax is <<name>> to match an entire token span, e.g. a region of the s-attribute name. We will therefore refer to this item as a region element.²⁶ The two queries below are fully equivalent:
> "dine" [pos = "IN"] <<np>>;
> "dine" [pos = "IN"] <np> []+ </np>;
The implementation of <<np>> in the first of these queries is almost exactly the same as that used for the second query: CQP will step through all tokens of the region until it reaches its end, so both queries have very similar execution times. But a key difference is that <<np>> doesn't allow any constraints on the content of the region or on annotation values of the s-attribute; it also needs less trickery behind the scenes to match up start and end tags and is therefore less prone to issues with obscure corner cases of CQP queries.²⁷
Structural annotation can sometimes be generated with a (usually rather complex) CQP query, e.g. noun phrase chunking with a query such as
```
> NP = (?longest) [pos = "DT"]? 
       ( [pos = "JJ.*"] ([word=",|and|or" | pos="RB"]* [pos = "JJ.*"])* )? 
       [pos = "NNS?"]+;
```
It will often be desirable to reuse such noun chunks in other queries, and the customary recommendation is to define a suitable macro /np[] (cf. Sec. 6.4), which can then be integrated into queries for larger patterns such as “NP about NP”:
> /np[] "about" /np[];
This approach results in extremely complex and slow queries that are difficult to debug. It also makes it necessary to manage a macro library in order to use such query fragments across multiple CQP sessions (with different versions of the macro for different tagsets).
One possibility – which has already been used to implement YAC, a cascaded finite-state chunk parser for German (see Sec. 1.1) – is to dump the query result and index it as a structural attribute in the corpus with cwb-s-encode. The query for “NP about NP” can then be executed more efficiently as
> <<np>> "about" <<np>>;
in the new region element syntax. This approach has several drawbacks:
- the end user must have write access to the corpus data directory and registry file, and a suitable wrapper infrastructure has to be implemented;
- registry entries and show cd; may be cluttered with a large number of undocumented s-attributes if multiple users have write access to the corpus;
- CQP has to be restarted in order to recognise the newly indexed s-attributes;
- a query result may contain overlapping and even nested spans,²⁸which are not allowed in s-attributes; care has to be taken to discard the problematic matches beforehand.
Region elements provide a convenient and powerful alternative: <<NP>> interprets the named query result NP as structural annotation and matches any of its match...matchend spans.²⁹ We can thus immediately search for “NP about NP” with the query
> <<NP>> "about" <<NP>>;
Named query results thus provide flexible ad-hoc annotation. They can directly be used (and updated) as temporary annotation in a running CQP session, but can also be persisted to disk (save NP;) and will be loaded on-demand in a new session. Ad-hoc annotation does not need special file access permissions and is private to each user, but can also easily be shared with other users (by copying the saved query results).
If naming conventions are observed (lowercase for s-attribute names, CamelCase for named query results), there can be no conflict between the two uses of region elements. Otherwise, a query result always takes precedence, since it is the main application of the new feature.
To search for spans between other anchors in the query result, make a copy of the NQR and use set to modify the matching span. Assuming that (for instance) a target anchor has been set on the first noun in NP,³⁰ we would type
> PreNominal = NP;
> set PreNominal matchend target[-1] !;
in order to be able match the prenominal elements of a noun phrase with <<PreNominal>>. The set ... !; operation automatically discards any matches consisting only of nouns, because then target[-1] precedes the match anchor (resulting in an invalid span).³¹
Of course, region elements can be repeated with quantifiers, included in a complex subexpression, etc. As a simple example, consider the query
> <<NP>> ( [pos = "IN|TO"] <<NP>> ){5};
Let us now consider a query result NamedEntity that contains tentatively identified named entities in the corpus:
```
> NamedEntity = (?traditional) 
                <np> []* [pos = "NPS?"] </np> | <np1> []* [pos = "NPS?"] </np1>;
```
where the traditional matching strategy enables nested matches for complex named entities such as the [Lord Mayor of [London]]. We can then find a preposition followed by a tentative named entity:
> (?traditional) [pos = "IN|TO"] <<NamedEntity>>;
taking all nested and/or overlapping NE candidates into account, which would not be possible with an s-attribute. Again, the traditional matching strategy enables nested matches (e.g. look for the line in the presence of the Ghost of Christmas).
Such named entity candidates are often generated by an external annotation tool and would traditionally be encoded as an s-attribute. In order to allow for overlapping and nested NEs, they can be undumped into a NQR instead. If there is an additional categorisation of NEs, multiple NQRs (NEperson, NEloc, NEorg, ...) can be created for the different classes.
Labels and target markers can be set on the first and last token of the range matched by a region element, using the same notation as for token expressions. Those for the first token are inserted immediately after the opening <<, those for the last token immediately before the closing >>. It is recommended (and sometimes necessary) to separate them from the element name with whitespace in order to avoid parsing errors.
The example below uses labels to ensure that the second NP is at least 3 tokens long and it sets target markers on the end of the first NP and the start of the second (with the other boundaries implicity given by the match and matchend anchors).
> <<NP @0>> "after" <<@1 a: NP b:>> :: distabs(a, b) >= 2;
Zero-width region elements are indicated by a slash after the element name, e.g. <<NP/>>. They match at the start of a suitable span, but do not consume any tokens. Because there might be multiple spans at the same position, a label or target marker can only be set on the first token of the span. Such zero-width region elements are of little use for s-attributes: <<np/>> behaves almost exactly like the XML start tag <np>, except that it allows a label to be set and does not match up with a corresponding end tag </np>.
The main purpose of zero-width region elements is to enable anchored queries, which take their potential starting point from a previous query result. This is similar to the earlier strategy of running a subquery starting with a <match> anchor (see Sec. 6.3), but much more convenient; moreover it allows for overlapping matches.
A query with a highly selective element near the end such as
> "the"%c [pos="JJ.*"]{3,} [lemma="creature"];
can now conveniently be sped up³²by a pre-filtering strategy:
> Cand = MU(meet "the"%c [lemma="creature"] 1 5);
> <<Cand/>> "the"%c [pos="JJ.*"]{3,} [lemma="creature"];