8.3 CQP built-in functions

The CQP query language offers a number of built-in functions that can be applied to attribute values within query constraints (but not anywhere else, e.g. in group or tabulate commands). The list below shows all built-in functions that are currently available.

f(att):
frequency of the current value of p-attribute att (cannot be used with s-attributes or literal values); e.g. [word = ".*able" & f(word) < 10]

dist(a, b), distance(a, b):
signed distance between two tokens referenced by labels a and b; explicit numeric corpus positions may be specified instead of labels; computes the difference $b - a$; e.g. ... :: dist(matchend, match) >= 10;

distabs(a, b):
unsigned distance between two tokens; e.g. [dist(_, 1000) <= 10] as an inefficient way to match 10 tokens to the left and right of corpus position 1000

int(str):
cast str to a signed integer number so numeric comparisons can be made; raises an error if str is not a numeric string; e.g. ... :: int(match.text_year) <= 1900;

lbound(att), rbound(att):
evaluates to true if the current corpus position is the first or last token in a region of s-attribute att, respectively

lbound_of(att, a), rbound_of(att, a):
returns the corpus position of the start or end of the region of s-attribute att containing the token referenced by label a, suitable for use with dist();20 if a is not within a region of att, an undefined value is returned, which evaluates to false in most contexts [new in v3.4.13]

unify(fs$_1$, fs$_2$):
compute the intersection of two sorted feature sets specified as strings fs$_1$ and fs$_2$, corresponding to a unification of feature bundles; if the first argument is an undefined value, fs$_2$ is returned; see Sec. 6.6 for details

ambiguity(fs):
compute the size of a feature set specified as string fs, i.e. the number of elements; if fs is an undefined value, a size of 0 is returned (same as for |); see Sec. 6.6 for details

add(x, y), sub(x, y), mul(x, y):
simple arithmetic on integer values x and y, which can also be corpus positions specified as labels; when performing computations on corpus annotations, they have to be typecast with int() first

prefix(str$_1$, str$_2$):
returns longest common prefix of strings str$_1$ and str$_2$; warning: this function operates on bytes and may return an incomplete UTF-8 character

is_prefix(str$_1$, str$_2$):
returns true if string str$_1$ is a prefix of str$_2$; e.g. [is_prefix(lemma, word)]

minus(str$_1$, str$_2$):
removes the longest common prefix of str$_1$ and str$_2$ from the string str$_1$ and returns the remaining suffix; warning: this function operates on bytes and may return an incomplete UTF-8 character

ignore(a):
ignore the label a and always return true; for internal use by the /undef[] macro, see Sec. 8.2 for details

normalize(str, flags):
apply case-folding and/or diacritic folding to the string str and return the normalized value; flags must be a literal string "c", "d" or "cd" (with an optional %, e.g. "%cd"); e.g. [normalize(word, "cd") != normalize(lemma, "cd")] to find non-trivial differences between word form and lemma [new in v3.4.11]

strlen(str):
returns the length of str in characters (if the active corpus is encoded in UTF-8) or bytes (for all other encodings) [new in v3.4.17]