8.3 CQP built-in functions
The CQP query language offers a number of built-in functions that can be
applied to attribute values within query constraints (but not anywhere else,
e.g. in group or tabulate commands).
The list below shows all built-in functions that are currently
available.
- f(att):
- frequency of the current value of p-attribute
att (cannot be used with s-attributes or literal values); e.g.
[word = ".*able" & f(word) < 10]
- dist(a, b), distance(a, b):
- signed
distance between two tokens referenced by labels a and b;
explicit numeric corpus positions may be specified instead of labels;
computes the difference ; e.g.
... :: dist(matchend, match) >= 10;
- distabs(a, b):
- unsigned distance between two tokens;
e.g.
[dist(_, 1000) <= 10]
as an inefficient way to match 10 tokens
to the left and right of corpus position 1000
- int(str):
- cast str to a signed integer number
so numeric comparisons can be made; raises an error if str
is not a numeric string; e.g.
... :: int(match.text_year) <= 1900;
- lbound(att), rbound(att):
- evaluates to true
if the current corpus position is the first or last token in a region of
s-attribute att, respectively
- lbound_of(att, a), rbound_of(att, a):
- returns the corpus position of the start or end of the region of s-attribute
att containing the token referenced by label a, suitable for use
with dist();20 if a is not within a region of att, an undefined value is returned,
which evaluates to false in most contexts [new in v3.4.13]
- unify(fs, fs):
- compute the intersection of two sorted
feature sets specified as strings fs and fs, corresponding
to a unification of feature bundles; if the first argument is an undefined value,
fs is returned; see Sec. 6.6 for details
- ambiguity(fs):
- compute the size of a feature set specified as string
fs, i.e. the number of elements; if fs is an undefined value, a size
of 0 is returned (same as for
|
);
see Sec. 6.6 for details
- add(x, y), sub(x, y), mul(x, y):
- simple arithmetic on integer values x and y, which can also be
corpus positions specified as labels; when performing computations on corpus
annotations, they have to be typecast with int() first
- prefix(str, str):
- returns longest common prefix of strings
str and str; warning: this function operates on bytes
and may return an incomplete UTF-8 character
- is_prefix(str, str):
- returns true if string
str is a prefix of str; e.g.
[is_prefix(lemma, word)]
- minus(str, str):
- removes the longest common prefix
of str and str from the string str and returns
the remaining suffix; warning: this function operates on bytes and may return
an incomplete UTF-8 character
- ignore(a):
- ignore the label a and always return true;
for internal use by the
/undef[]
macro, see
Sec. 8.2 for details
- normalize(str, flags):
- apply case-folding and/or diacritic folding
to the string str and return the normalized value;
flags must be a literal string "c", "d" or "cd"
(with an optional
%
, e.g. "%cd"
);
e.g. [normalize(word, "cd") != normalize(lemma, "cd")]
to find non-trivial differences between word form and lemma [new in v3.4.11]
- strlen(str):
- returns the length of str in characters
(if the active corpus is encoded in UTF-8) or bytes (for all other encodings)
[new in v3.4.17]