- new in v3.4.12: tabular (TAB) queries are an obscure
and formerly undocumented feature of CQP. They were dysfunctional for a long
time, but have now been resurrected. The implementation was originally
considered experimental, but is considered stable with full long-term
support for CWB 3.5.
- A TAB query starts with the keyword TAB and matches a sequence
of one or more token patterns with optional flexible gaps. In its simplest
form, it corresponds to a standard query matching a fixed sequence of tokens,
but is often executed faster. Compare standard query
> "in" "due" "course";
with the much more efficient TAB query
> TAB "in" "due" "course";
This query is both simpler and faster than the MU version
given in Sec. 8.4.
- The most substantial performance gains are achieved for sequences
that start with very frequent items and end in a selective token pattern, e.g.
> TAB [pos = "DT"] [pos = "JJ.*"] "tea";
TAB queries cannot be used as a general optimization for standard queries,
though, because the individual elements cannot be made optional or
repeated with a quantifier. It is also not possible to specify alternatives
(|
) within a TAB query (but you can run multiple queries and take
the union of the results).
- The main purpose of tabular queries is to match sequences with flexible
gaps. The following two-word TAB query finds cats followed by dogs
with a gap of up to two intervening tokens:
> TAB "cats" {0,2} "dogs";
It is equivalent to the standard query
> "cats" []{0,2} "dogs";
but keep in mind that TAB "cats" []{0,2} "dogs";
would mean something entirely different!21
- Gaps can be specified using any of the repetition operators familiar from standard queries
op. |
gap size |
? |
0 or 1 token |
* |
0 or more tokens |
+ |
1 or more tokens |
{n} |
exactly tokens |
{n,k} |
between and tokens |
{n,} |
at least tokens |
{,k} |
up to tokens (same as {0,k}) |
All gap specifications behave as if the repetition operator had been applied
to a matchall ([]
) in a standard query.
- TAB queries can additionally be restricted by a within clause.
For example, the query
> TAB "girl" {2} "girl";
finds a repetition of the noun girl after exactly two intervening tokens,
but many of the matches cross a sentence boundary.
In order to discard these matches, change the query to
> TAB "girl" {2} "girl" within s;
- TAB queries do not support different matching strategies, but always use
an early match principle similar to the default setting of standard CQP queries
(regardless of the value of the MatchingStrategy option)
For example, the query
> TAB [pos = "JJ"] {,5} [pos = "NN"];
will only match the underlined part of the phrase
a small and very old train station.
It cannot be configured to return the shortest (old train)
or longest (small and very old train station) match.
- TAB queries always return the full range of tokens containing
the specified items. Individual items cannot be marked in any way
(i.e. neither as target pattern nor with labels), due to limitations
of the current CQP implementation.
- When composing more complex TAB queries, it is important to understand
how their “greedy search”22approach works, since its results may be different from the corresponding
standard CQP queries.
- for every possible start position, i.e. each match of the first token pattern
- scan for a match of the second token pattern within the specified range
- greedily fix the first such match that is encountered
(i.e. the search algorithm commits to this corpus position being part of the full match)
- starting from this position, scan for a match of the third token pattern
- greedily fix the first such match that is encountered
- etc.
If a complete match is found, CQP continues with the next possible start position,
so there can be at most one match for each start position.
In addition, nested matches are discarded as in standard CQP queries
(hence old train above is actually matched by the algorithm,
but then discarded as a nested match).
- Always keep in mind that CQP does not perform an expensive combinatorial
search to consider other matches of token patterns that might also fall
within the specified ranges! If a greedily selected match item does not lead
to a complete match, but a later item for the same token pattern would have,
the correct solution will not be found.
- As a concrete example, consider the sentence
Fortunately , we had timeA
forB1 delicious pastries and
forB2 coffeeC .
The TAB query
> TAB "time" * "for" ? "coffee" within s;
would not match this sentence because "for" is greedily fixed
to the first available token B1, for which C is not in range.
The corresponding standard query
> "time" []* @"for" []? "coffee" within s;
considers both options, on the other hand, and matches the range
A...C, with the target anchor (@
) set to B2.
- There are two special cases in which TAB queries are guaranteed to
find every early match that satisfies the gap specifications:
- All gaps have a fixed size ({n}),
which can be different for each gap. This includes in particular
the case where the token patterns are directly adjacent.
> TAB "Mr" {1} "Mrs" [pos = "N.*"];
> TAB "in" "due" "course";
- All gaps are specified as
*
and the search range is only restricted
by a within clause. Note that *
and fixed-size gaps
(even direct adjacency) must not be mixed in this case.
> TAB "one" * "two" * "three" within s;