8.5 TAB queries

new in v3.4.12: tabular (TAB) queries are an obscure and formerly undocumented feature of CQP. They were dysfunctional for a long time, but have now been resurrected. The implementation was originally considered experimental, but is considered stable with full long-term support for CWB 3.5.
A TAB query starts with the keyword TAB and matches a sequence of one or more token patterns with optional flexible gaps. In its simplest form, it corresponds to a standard query matching a fixed sequence of tokens, but is often executed faster. Compare standard query
> "in" "due" "course";
with the much more efficient TAB query
> TAB "in" "due" "course";
This query is both simpler and faster than the MU version given in Sec. 8.4.
The most substantial performance gains are achieved for sequences that start with very frequent items and end in a selective token pattern, e.g.
> TAB [pos = "DT"] [pos = "JJ.*"] "tea";
TAB queries cannot be used as a general optimization for standard queries, though, because the individual elements cannot be made optional or repeated with a quantifier. It is also not possible to specify alternatives (|) within a TAB query (but you can run multiple queries and take the union of the results).
The main purpose of tabular queries is to match sequences with flexible gaps. The following two-word TAB query finds cats followed by dogs with a gap of up to two intervening tokens:
> TAB "cats" {0,2} "dogs";
It is equivalent to the standard query
> "cats" []{0,2} "dogs";
but keep in mind that TAB "cats" []{0,2} "dogs"; would mean something entirely different!²¹
Gaps can be specified using any of the repetition operators familiar from standard queries

op. gap size

? 0 or 1 token

* 0 or more tokens

+ 1 or more tokens

{n} exactly tokens

{n,k} between and tokens

{n,} at least tokens

{,k} up to tokens (same as {0,k})

All gap specifications behave as if the repetition operator had been applied to a matchall ([]) in a standard query.
TAB queries can additionally be restricted by a within clause. For example, the query
> TAB "girl" {2} "girl";
finds a repetition of the noun girl after exactly two intervening tokens, but many of the matches cross a sentence boundary. In order to discard these matches, change the query to
> TAB "girl" {2} "girl" within s;
TAB queries do not support different matching strategies, but always use an early match principle similar to the default setting of standard CQP queries (regardless of the value of the MatchingStrategy option) For example, the query
> TAB [pos = "JJ"] {,5} [pos = "NN"];
will only match the underlined part of the phrase a small and very old train station. It cannot be configured to return the shortest (old train) or longest (small and very old train station) match.
TAB queries always return the full range of tokens containing the specified items. Individual items cannot be marked in any way (i.e. neither as target pattern nor with labels), due to limitations of the current CQP implementation.
When composing more complex TAB queries, it is important to understand how their “greedy search”²²approach works, since its results may be different from the corresponding standard CQP queries.
- for every possible start position, i.e. each match of the first token pattern
- scan for a match of the second token pattern within the specified range
- greedily fix the first such match that is encountered
  (i.e. the search algorithm commits to this corpus position being part of the full match)
- starting from this position, scan for a match of the third token pattern
- greedily fix the first such match that is encountered
- etc.
If a complete match is found, CQP continues with the next possible start position, so there can be at most one match for each start position. In addition, nested matches are discarded as in standard CQP queries (hence old train above is actually matched by the algorithm, but then discarded as a nested match).
Always keep in mind that CQP does not perform an expensive combinatorial search to consider other matches of token patterns that might also fall within the specified ranges! If a greedily selected match item does not lead to a complete match, but a later item for the same token pattern would have, the correct solution will not be found.
As a concrete example, consider the sentence
Fortunately , we had time_A for_B1 delicious pastries and for_B2 coffee_C .

The TAB query
> TAB "time" * "for" ? "coffee" within s;
would not match this sentence because "for" is greedily fixed to the first available token B1, for which C is not in range. The corresponding standard query
> "time" []* @"for" []? "coffee" within s;
considers both options, on the other hand, and matches the range A...C, with the target anchor (@) set to B2.
There are two special cases in which TAB queries are guaranteed to find every early match that satisfies the gap specifications:
1. All gaps have a fixed size ({n}), which can be different for each gap. This includes in particular the case where the token patterns are directly adjacent.
  > TAB "Mr" {1} "Mrs" [pos = "N.*"];
  > TAB "in" "due" "course";
2. All gaps are specified as * and the search range is only restricted by a within clause. Note that * and fixed-size gaps (even direct adjacency) must not be mixed in this case.
  > TAB "one" * "two" * "three" within s;

op.	gap size
`?`	0 or 1 token
`*`	0 or more tokens
`+`	1 or more tokens
`{n}`	exactly tokens
`{n,k}`	between and tokens
`{n,}`	at least tokens
`{,k}`	up to tokens (same as {0,k})