8.3 CQP built-in functions

The CQP query language offers a number of built-in functions that can be applied to attribute values within query constraints (but not anywhere else, e.g. in group or tabulate commands). The list below shows all built-in functions that are currently available.

f(att):: frequency of the current value of p-attribute att (cannot be used with s-attributes or literal values); e.g. [word = ".*able" & f(word) < 10]
dist(a, b), distance(a, b):: signed distance between two tokens referenced by labels a and b; explicit numeric corpus positions may be specified instead of labels; computes the difference ; e.g. ... :: dist(matchend, match) >= 10;
distabs(a, b):: unsigned distance between two tokens; e.g. [dist(_, 1000) <= 10] as an inefficient way to match 10 tokens to the left and right of corpus position 1000
int(str):: cast str to a signed integer number so numeric comparisons can be made; raises an error if str is not a numeric string; e.g. ... :: int(match.text_year) <= 1900;
lbound(att), rbound(att):: evaluates to true if the current corpus position is the first or last token in a region of s-attribute att, respectively
lbound_of(att, a), rbound_of(att, a):: returns the corpus position of the start or end of the region of s-attribute att containing the token referenced by label a, suitable for use with dist();²⁰ if a is not within a region of att, an undefined value is returned, which evaluates to false in most contexts [new in v3.4.13]
unify(fs, fs):: compute the intersection of two sorted feature sets specified as strings fs and fs, corresponding to a unification of feature bundles; if the first argument is an undefined value, fs is returned; see Sec. 6.6 for details
ambiguity(fs):: compute the size of a feature set specified as string fs, i.e. the number of elements; if fs is an undefined value, a size of 0 is returned (same as for |); see Sec. 6.6 for details
add(x, y), sub(x, y), mul(x, y):: simple arithmetic on integer values x and y, which can also be corpus positions specified as labels; when performing computations on corpus annotations, they have to be typecast with int() first
prefix(str, str):: returns longest common prefix of strings str and str; warning: this function operates on bytes and may return an incomplete UTF-8 character
is_prefix(str, str):: returns true if string str is a prefix of str; e.g. [is_prefix(lemma, word)]
minus(str, str):: removes the longest common prefix of str and str from the string str and returns the remaining suffix; warning: this function operates on bytes and may return an incomplete UTF-8 character
ignore(a):: ignore the label a and always return true; for internal use by the /undef[] macro, see Sec. 8.2 for details
normalize(str, flags):: apply case-folding and/or diacritic folding to the string str and return the normalized value; flags must be a literal string "c", "d" or "cd" (with an optional %, e.g. "%cd"); e.g. [normalize(word, "cd") != normalize(lemma, "cd")] to find non-trivial differences between word form and lemma [new in v3.4.11]
strlen(str):: returns the length of str in characters (if the active corpus is encoded in UTF-8) or bytes (for all other encodings) [new in v3.4.17]