- 1
- Desiging and evaluating Extraction Tools for Collocations in Dictionaries and Corpora
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 2
- Recall that only the nesting of a <np> region within a
larger <np> region constitues recursion in the CWB data model.
The nesting of <pp> within <np> (and vice versa) is
unproblematic, since these regions are encoded in two independent
s-attributes (named pp and np).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 3
- The -e mode is not enabled by default for reasons of
backward compatibility. When command-line editing is active, multi-line
commands are not allowed, even when the input is read from a pipe.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 4
- You can alternatively use 7-zip to handle these file formats,
by setting the environment variable CWB_USE_7Z (CWB v3.4.35+).
7-zip is somewhat easier to install on Windows in particular
than gzip and bzip2 (from https://www.7-zip.org/),
but note that you still need to make
sure that the 7z program is findable your environment's PATH.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 5
- But see the notes on pipes in Sec. 3.3.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 6
- Earliest here refers to corpus position,
not to the position of the token pattern in the query string
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 7
- External sorting may also allow language-specific sort order
(collation) if supported by the system's sort command.
To achieve this on Unix, set the LC_COLLATE or LC_ALL
environment variable to an appropriate locale before running CQP. You
should not use the %c and %d flags in this case.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 8
- Windows Subsystem for Linux; for purposes of running CWB, WSL is
just another Unix.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 9
- The keyword and target anchors are set to
undefined (-1) when no match is found for the search pattern,
while the match and matchend anchors retain their
previous values. In this way, a set match or set
matchend command may only modify some of the matches in a named query
result.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 10
- Keep in mind that you have to type
cat NPobj; in order to display the result, because the implicit
NQR Last is just a copy that will not have been modified.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 11
- This issue arises if the match or
matchend anchor is modified: What should the algorithm do if the
source anchor is undefined? What if the update would create an invalid
matching range with matchend match? Up to v3.4.30,
CQP would leave the match unmodified in the first case, but shorten it to
a single token in the second case.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 12
- If the destination anchor is newly
created by the command, it is initialised to undefined values.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 13
- If target is to the left of the match, S1 extends the
start of the match to target, then S2 sets matchend to
the same position. If target is to the right of the match, S1
is a no-op (because it would cross over match with matchend).
S2 then extends the end of the match to target, and S3 can set
match to the same position. In either case, S3 deletes matches
for which target is undefined. Display the query using
cat Elephants; after each step for an illustration of this process.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 14
- These rules are designed to
produce the effect described above, i.e. optional elements at the start
of a query are included, but those at the end are excluded. Note that
standard does allow nested matches provided they are
properly nested, i.e. have neither the same start point nor the
same end point.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 15
- Overlapping matches may result from the traditional
matching strategy, set operations, or modification of the matching word
sequences with expand, set match, or set
matchend. When a named query with overlapping matches is activated, a
warning message is issued and some of the matches will be automatically
deleted.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 16
- See
https://www.pcre.org/original/doc/html/pcrepattern.html#SEC5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 17
- CR or LF is probably a very bad idea.
As is use of the same string for both attribute and token separators.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 18
- Since this command dumps the matches of a named query in their
current sort order, the natural order should normally first be restored by calling
sort without a by clause. One exception is if the dump is
to be used for a KWIC display of the query results in their sorted order.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 19
- Of course, it is also possible to establish an indirect link
through document IDs, annotated as <dialogue id=XXXX> ..
</dialogue>. If the corpus contains a very large number of dialogues,
the direct link approach is usually much more efficient, though.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 20
- The second argument is necessitated by technical limitations of built-in functions.
To locate the start of a sentence containing the current token, use the
this label: lbound_of(s, _).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 21
- Namely, cats followed by an arbitrary token, followed by a gap of up to two tokens, followed by dogs.
Entering this command will print an error message because matchall patterns are not allowed in TAB queries.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 22
- Note that greedy is used in a different sense here than
the “greedy matching” of regular expression quantifiers.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 23
- In earlier versions of CQP, anchored subqueries can be used
by activating Result as a subcorpus and anchoring the additional
queries with <match> in initial position. This is less convenient
and can lead to issues if there are overlapping matches
(which are not allowed for subcorpora).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 24
- Cautious programmers might want to verify that the matching ranges
of each dump Temp; are identical before discarding the first two
columns of the dump. Alternatively,
tabulate Temp target, keyword; can be used from the second
iteration in order to avoid redundant information. It would seem to be more
efficient to obtain the matching ranges and first two anchors from
dump Result; and save one iteration. However, the result sets
might not be identical if Result contains multiple matches starting
at the same corpus position (typically from top-level alternatives in the
query), which is not possible for the anchored queries. It is safe to use
the faster solution if matching strategy is set to shortest
or longest.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 25
- The new feature is only available in standard “finite state”
queries, of course, because both MU and TAB queries are entirely
token-based and cannot integrate structural annotation.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 26
- In contrast to the use in queries of actual XML tags,
there can be whitespace between the brackets and the element name,
e.g. << np >>.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 27
- See the document Finite State Queries in CWB3,
available in the CWB subversion repository, for a discussion of these issues,
as well as a sketch of the implementation of ad-hoc annotation.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 28
- depending on the matching strategy used,
and always for the union of query results
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 29
- For technical reasons, a similar approach involving XML tags
<NP> and </NP> was not feasible;
nor is it possible to select arbitrary spans
(e.g. target...matchend) from the query result.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 30
- It is left as an exercise to the reader
to set this anchor in the query above, most conveniently
with @[::].
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 31
- This did not work correctly
before CQP v3.4.31, which also introduced support for offsets.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- 32
- More than 5 times faster on this author's laptop computer.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.