A..1 Summary of regular expression syntax
At the character level, CQP supports regular expressions using one of two
regex libraries:
CWB 3.0: Uses POSIX 1003.2 regular expressions (as
provided by the system libraries). A full description of the regular
expression syntax can be found on the regex(7) manpage.
CWB 3.5: Uses PCRE (Perl Compatible Regular Expressions).
A full description of the regular expression syntax can be found
on the pcrepattern(3) manpage; see also http://www.pcre.org/.
Various books such as Mastering Regular Expressions give a gentle introduction
to writing regular expressions and provide a lot of additional information.
There are also many tutorials to be found online
using Your Favourite Web Search EngineTM.
- A regular expression is a concise descriptions of a set of character
strings (which are called words in formal language theory). Note
that only certain sets of words with a relatively simple structure can be
represented in such a way. Regular expressions are said to match the
words they describe. The following examples use the notation:
<reg.exp.>
word
, word
, ...
to indicate that the regular expression before the arrow matches
the word or words after the arrow. In many programming languages,
it is customary to enclose regular expressions in forward slashes (/).
CQP uses a different syntax: regular expressions are written
as (single- or double-quoted) strings.
The examples below omit any delimiters.
- Basic syntax of regular expressions
- letters and digits are matched literally (including all non-ASCII characters)
word
word;
C3PO
C3PO;
déjà
déjà
- . matches any single character (“matchall”)
r.ng
ring, rung, rang, rkng, r3ng,
...
- character set: [...] matches any of the characters listed
moderni[sz]e
modernise, modernize
[a-c5-9]
a, b, c, 5, 6, 7, 8, 9
[^aeiou]
b, c, d, f, ..., 1, 2, 3, ..., ä,
à, á, ...
- repetition of the preceding element (character or group):
? (0 or 1), * (0 or more), + (1 or
more), {
data:image/s3,"s3://crabby-images/7771b/7771b2d918e541710548c51b05321f588edebeca" alt="$n$"
}
(exactly
),
{
data:image/s3,"s3://crabby-images/7771b/7771b2d918e541710548c51b05321f588edebeca" alt="$n$"
,
data:image/s3,"s3://crabby-images/c525b/c525b22832df2a462d097c8b21e31ee22f0ef844" alt="$m$"
}
(
)
colou?r
color, colour;
go{2,4}d
good, goood, goood
[A-Z][a-z]+
“regular” capitalised
word such as British
- grouping with parentheses: (...)
(bla)+
bla, blabla, blablabla, ...
(school)?bus(es)?
bus, buses, schoolbus, schoolbuses
- | separates alternatives (use parentheses to limit scope)
mouse|mice
mouse, mice;
corp(us|ora)
corpus, corpora
- Complex regular expressions can be used to model (regular) inflection:
- ask(s|ed|ing)?
ask, asks, asked, asking
(equivalent to the less compact expression ask|asks|asked|asking)
- sa(y(s|ing)?|id)
say, says, saying, said
- [a-z]+i[sz](e[sd]?|ing)
any form of a verb with
-ise or -ize suffix
- Backslash (
\
) “escapes” special characters, i.e. forces them to match literally
\?
?;data:image/s3,"s3://crabby-images/6ebb3/6ebb325b53837d49a928f8b740a31780953acc62" alt="$\quad$"
\(\)
();
.{3}
...;data:image/s3,"s3://crabby-images/6ebb3/6ebb325b53837d49a928f8b740a31780953acc62" alt="$\quad$"
\$\.
$.
\^
and \$
must be escaped although ^
and $
anchors are not useful in CQP