- CQP offers search-engine-like “Boolean” queries in a special meet-union
(MU) notation. This feature goes back to the original developer of CWB,
but was not supported officially before CWB v3.4.12. In particular,
there was no precise specification of the semantics of MU queries,
and the original implementation did not produce consistent results.
- new in v3.4.12: Recently, MU queries have found more widespread use as
proximity queries in the CEQL “simple query” syntax of BNCweb and CQPweb,
giving them a semi-official status. For this reason, the implementation was modified
to ensure a consistent and well-defined behaviour, although it may not always
correspond to what is desired intuitively. The new MU implementation is documented here.
- Warning: both the syntax and the semantics of MU queries are subject to
fundamental revisions in the next major release of CWB (version 4.0),
but are considered stable with long-term support for CWB 3.5.
- A meet-union query consists of nested meet and union operations
forming a binary-branching tree that is written in LISP-like prefix notation.
MU queries always start with the keyword MU
and are completely separate from the standard CQP syntax, sharing only the system
by which individual token patterns are specified
- The simplest form of a MU query specifies a single token pattern,
which may also be given in shorthand notation if the default p-attribute is to be matched.
These queries are fully equivalent to the corresponding standard queries
(which would be the same, but without the leading MU).
> MU [lemma = "light" & pos = "V.*"];
> MU "lights" %c;
- A meet clause matches two token patterns within a specified
distance of each other. More precisely, instances of the first pattern are filtered,
keeping only those where the second pattern occurs within the specified window.
For example, the following query finds nouns that co-occur with the adjective lovely:
> MU(meet [pos = "NN.*"] [lemma = "lovely"] -2 2);
This query returns all nouns for which lovely occurs within two tokens to the left
(window starting at offset -2) or right (window ending at offset +2)).
The adjective lovely is not included in the match, nor marked in any other way.
- In order to match only prenominal adjectives, we can change the window to include only
the three tokens preceding the noun (i.e. offsets -3 ...-1):
> MU(meet [pos = "NN.*"] [lemma = "lovely"] -3 -1);
- Since a meet clause returns only occurrences of the first token pattern,
we need to change the ordering in order to focus on the adjective rather than the nouns.
Don't forget to adjust the window offsets accordingly!
> MU(meet [lemma = "lovely"] [pos = "NN.*"] 1 3);
Note that meet operations are not symmetric: this query returns fewer matches
than the previous one (viz. those cases where multiple nouns occur near the same
instance of lovely).
- Alternatively, we can search for co-occurrence within sentences or other s-attribute
regions. Again, the ordering of the token constrains determines whether we focus on
tea or cakes:
> MU(meet "tea"%c "cakes"%c s);
- A union clause simply combines the matches of two token patterns
into a set union, corresponding to a disjunction (logical or) of the constraints.
The following three queries are fully equivalent:
> MU(union "tea"%c "coffee"%c);
> "tea"%c | "coffee"%c;
> [(word = "tea" %c) | (word = "coffee" %c)];
- MU queries are relatively powerful because the two elements of a meet
or union clause can themselves be complex clauses.
For example, the trigram in due course can be found by nesting two
meet conditions:
> MU(meet (meet "in" "due" 1 1) "course" 2 2);
The inner clause returns all instances of in that are immediately followed by
due; the outer clause requires that the following token
(the token at an offset of +2 from in) must be course.
We can obtain exactly the same result with this query:
> MU(meet "in" (meet "due" "course" 1 1) 1 1);
Now the inner clause determines all occurrences of the bigram due course,
but returns only the corpus positions of due,
which must appear immediately after in.
Can you find two other MU formulations that produce exactly the same results?
- Keep in mind that the final result includes only the corpus positions of
the leftmost-specified token pattern. If you want to find instances of
course in this multiword expression, rewrite the query as
> MU(meet (meet "course" "due" -1 -1) "in" -2 -2);
- new in v3.4.30: meet clauses in MU queries can now also be
negated. When the not operator is given, CQP discards
(rather than eclusively retaining) all matches for the first subclause that
co-occur with a match of the second subclause. The following query finds all
instances of ground that are not preceded by an article
within a span of 3 tokens:
> MU(meet "ground" not [pos = "DT"] -3 -1);
Such negated meet clauses work for all context specifications
and can be used in arbitrarily nested MU queries,
allowing for even more complex co-occurrence filters.
- MU queries are less flexible than standard CQP queries, because they lack
the capacity for token-level regular expressions. But MU queries can be much
more efficient for determining co-occurrences at relatively large distances,
and for finding sequences that consist of one or more very frequent elements
followed by a rare item. For example,
> MU(meet (meet [pos="NN.*"] "virtue" 2 2) "of" 1 1);
is considerably faster than
> [pos = "NN.*"] "of" "virtue";
- This query finds sentences that contain both one hand and
other hand. The MU query returns only the position of one,
which is then expanded to the complete sentence:
> MU(meet (meet "one" "hand" 1 1) (meet "other" "hand" 1 1) s) expand to s;
- Combinations of meet and union clauses offer additional
flexibility. The following query finds nouns occurring close to a superlative
adjective, which can either be synthetic (strangest) or analytic
(most extravagant).
> MU(meet [pos="NNS?"] (union [pos="JJS"] (meet [pos="JJ"] "most" -1 -1)) -2 4);
- Like standard queries, MU queries can be used as subquery filters
(followed by !) or combined with a cut and/or
expand clause. However, other elements of standard queries
are not supported: labels, target markers (@), zero-width
assertions (obviously), global constraints (after ::),
alignment constraints and within clauses.