8.4 MU queries

CQP offers search-engine-like “Boolean” queries in a special meet-union (MU) notation. This feature goes back to the original developer of CWB, but was not supported officially before CWB v3.4.12. In particular, there was no precise specification of the semantics of MU queries, and the original implementation did not produce consistent results.
new in v3.4.12: Recently, MU queries have found more widespread use as proximity queries in the CEQL “simple query” syntax of BNCweb and CQPweb, giving them a semi-official status. For this reason, the implementation was modified to ensure a consistent and well-defined behaviour, although it may not always correspond to what is desired intuitively. The new MU implementation is documented here.
Warning: both the syntax and the semantics of MU queries are subject to fundamental revisions in the next major release of CWB (version 4.0), but are considered stable with long-term support for CWB 3.5.
A meet-union query consists of nested meet and union operations forming a binary-branching tree that is written in LISP-like prefix notation. MU queries always start with the keyword MU and are completely separate from the standard CQP syntax, sharing only the system by which individual token patterns are specified
The simplest form of a MU query specifies a single token pattern, which may also be given in shorthand notation if the default p-attribute is to be matched. These queries are fully equivalent to the corresponding standard queries (which would be the same, but without the leading MU).
> MU [lemma = "light" & pos = "V.*"];
> MU "lights" %c;
A meet clause matches two token patterns within a specified distance of each other. More precisely, instances of the first pattern are filtered, keeping only those where the second pattern occurs within the specified window. For example, the following query finds nouns that co-occur with the adjective lovely:
> MU(meet [pos = "NN.*"] [lemma = "lovely"] -2 2);
This query returns all nouns for which lovely occurs within two tokens to the left (window starting at offset -2) or right (window ending at offset +2)). The adjective lovely is not included in the match, nor marked in any other way.
In order to match only prenominal adjectives, we can change the window to include only the three tokens preceding the noun (i.e. offsets -3 ...-1):
> MU(meet [pos = "NN.*"] [lemma = "lovely"] -3 -1);
Since a meet clause returns only occurrences of the first token pattern, we need to change the ordering in order to focus on the adjective rather than the nouns. Don't forget to adjust the window offsets accordingly!
> MU(meet [lemma = "lovely"] [pos = "NN.*"] 1 3);
Note that meet operations are not symmetric: this query returns fewer matches than the previous one (viz. those cases where multiple nouns occur near the same instance of lovely).
Alternatively, we can search for co-occurrence within sentences or other s-attribute regions. Again, the ordering of the token constrains determines whether we focus on tea or cakes:
> MU(meet "tea"%c "cakes"%c s);
A union clause simply combines the matches of two token patterns into a set union, corresponding to a disjunction (logical or) of the constraints. The following three queries are fully equivalent:
> MU(union "tea"%c "coffee"%c);
> "tea"%c | "coffee"%c;
> [(word = "tea" %c) | (word = "coffee" %c)];
MU queries are relatively powerful because the two elements of a meet or union clause can themselves be complex clauses. For example, the trigram in due course can be found by nesting two meet conditions:
> MU(meet (meet "in" "due" 1 1) "course" 2 2);
The inner clause returns all instances of in that are immediately followed by due; the outer clause requires that the following token (the token at an offset of +2 from in) must be course. We can obtain exactly the same result with this query:
> MU(meet "in" (meet "due" "course" 1 1) 1 1);
Now the inner clause determines all occurrences of the bigram due course, but returns only the corpus positions of due, which must appear immediately after in.
Can you find two other MU formulations that produce exactly the same results?
Keep in mind that the final result includes only the corpus positions of the leftmost-specified token pattern. If you want to find instances of course in this multiword expression, rewrite the query as
> MU(meet (meet "course" "due" -1 -1) "in" -2 -2);
new in v3.4.30: meet clauses in MU queries can now also be negated. When the not operator is given, CQP discards (rather than eclusively retaining) all matches for the first subclause that co-occur with a match of the second subclause. The following query finds all instances of ground that are not preceded by an article within a span of 3 tokens:
> MU(meet "ground" not [pos = "DT"] -3 -1);
Such negated meet clauses work for all context specifications and can be used in arbitrarily nested MU queries, allowing for even more complex co-occurrence filters.
MU queries are less flexible than standard CQP queries, because they lack the capacity for token-level regular expressions. But MU queries can be much more efficient for determining co-occurrences at relatively large distances, and for finding sequences that consist of one or more very frequent elements followed by a rare item. For example,
> MU(meet (meet [pos="NN.*"] "virtue" 2 2) "of" 1 1);
is considerably faster than
> [pos = "NN.*"] "of" "virtue";
This query finds sentences that contain both one hand and other hand. The MU query returns only the position of one, which is then expanded to the complete sentence:
> MU(meet (meet "one" "hand" 1 1) (meet "other" "hand" 1 1) s) expand to s;
Combinations of meet and union clauses offer additional flexibility. The following query finds nouns occurring close to a superlative adjective, which can either be synthetic (strangest) or analytic (most extravagant).
> MU(meet [pos="NNS?"] (union [pos="JJS"] (meet [pos="JJ"] "most" -1 -1)) -2 4);
Like standard queries, MU queries can be used as subquery filters (followed by !) or combined with a cut and/or expand clause. However, other elements of standard queries are not supported: labels, target markers (@), zero-width assertions (obviously), global constraints (after ::), alignment constraints and within clauses.