... DECIDE1
Desiging and evaluating Extraction Tools for Collocations in Dictionaries and Corpora
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...2
Recall that only the nesting of a <np> region within a larger <np> region constitues recursion in the CWB data model. The nesting of <pp> within <np> (and vice versa) is unproblematic, since these regions are encoded in two independent s-attributes (named pp and np).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... features3
The -e mode is not enabled by default for reasons of backward compatibility. When command-line editing is active, multi-line commands are not allowed, even when the input is read from a pipe.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...4
You can alternatively use 7-zip to handle these file formats, by setting the environment variable CWB_USE_7Z (CWB v3.4.35+). 7-zip is somewhat easier to install on Windows in particular than gzip and bzip2 (from https://www.7-zip.org/), but note that you still need to make sure that the 7z program is findable your environment's PATH.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... pipe5
But see the notes on pipes in Sec. 3.3.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... token6
Earliest here refers to corpus position, not to the position of the token pattern in the query string
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...7
External sorting may also allow language-specific sort order (collation) if supported by the system's sort command. To achieve this on Unix, set the LC_COLLATE or LC_ALL environment variable to an appropriate locale before running CQP. You should not use the %c and %d flags in this case.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... WSL8
Windows Subsystem for Linux; for purposes of running CWB, WSL is just another Unix.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... matches9
The keyword and target anchors are set to undefined (-1) when no match is found for the search pattern, while the match and matchend anchors retain their previous values. In this way, a set match or set matchend command may only modify some of the matches in a named query result.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... NP10
Keep in mind that you have to type cat NPobj; in order to display the result, because the implicit NQR Last is just a copy that will not have been modified.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...11
This issue arises if the match or matchend anchor is modified: What should the algorithm do if the source anchor is undefined? What if the update would create an invalid matching range with matchend $<$ match? Up to v3.4.30, CQP would leave the match unmodified in the first case, but shorten it to a single token in the second case.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... unmodified.12
If the destination anchor is newly created by the command, it is initialised to undefined values.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...13
If target is to the left of the match, S1 extends the start of the match to target, then S2 sets matchend to the same position. If target is to the right of the match, S1 is a no-op (because it would cross over match with matchend). S2 then extends the end of the match to target, and S3 can set match to the same position. In either case, S3 deletes matches for which target is undefined. Display the query using cat Elephants; after each step for an illustration of this process.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... point;14
These rules are designed to produce the effect described above, i.e. optional elements at the start of a query are included, but those at the end are excluded. Note that standard does allow nested matches provided they are properly nested, i.e. have neither the same start point nor the same end point.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... non-overlapping15
Overlapping matches may result from the traditional matching strategy, set operations, or modification of the matching word sequences with expand, set match, or set matchend. When a named query with overlapping matches is activated, a warning message is issued and some of the matches will be automatically deleted.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... \pP16
See https://www.pcre.org/original/doc/html/pcrepattern.html#SEC5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...17
CR or LF is probably a very bad idea. As is use of the same string for both attribute and token separators.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...18
Since this command dumps the matches of a named query in their current sort order, the natural order should normally first be restored by calling sort without a by clause. One exception is if the dump is to be used for a KWIC display of the query results in their sorted order.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...19
Of course, it is also possible to establish an indirect link through document IDs, annotated as <dialogue id=XXXX> .. </dialogue>. If the corpus contains a very large number of dialogues, the direct link approach is usually much more efficient, though.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...dist();20
The second argument is necessitated by technical limitations of built-in functions. To locate the start of a sentence containing the current token, use the this label: lbound_of(s, _).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...21
Namely, cats followed by an arbitrary token, followed by a gap of up to two tokens, followed by dogs. Entering this command will print an error message because matchall patterns are not allowed in TAB queries.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...22
Note that greedy is used in a different sense here than the “greedy matching” of regular expression quantifiers.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...23
In earlier versions of CQP, anchored subqueries can be used by activating Result as a subcorpus and anchoring the additional queries with <match> in initial position. This is less convenient and can lead to issues if there are overlapping matches (which are not allowed for subcorpora).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...24
Cautious programmers might want to verify that the matching ranges of each dump Temp; are identical before discarding the first two columns of the dump. Alternatively, tabulate Temp target, keyword; can be used from the second iteration in order to avoid redundant information. It would seem to be more

efficient to obtain the matching ranges and first two anchors from dump Result; and save one iteration. However, the result sets might not be identical if Result contains multiple matches starting at the same corpus position (typically from top-level alternatives in the query), which is not possible for the anchored queries. It is safe to use

the faster solution if matching strategy is set to shortest or longest.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...25
The new feature is only available in standard “finite state” queries, of course, because both MU and TAB queries are entirely token-based and cannot integrate structural annotation.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... element.26
In contrast to the use in queries of actual XML tags, there can be whitespace between the brackets and the element name, e.g. << np >>.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... queries.27
See the document Finite State Queries in CWB3, available in the CWB subversion repository, for a discussion of these issues, as well as a sketch of the implementation of ad-hoc annotation.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...28
depending on the matching strategy used, and always for the union of query results
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...29
For technical reasons, a similar approach involving XML tags <NP> and </NP> was not feasible; nor is it possible to select arbitrary spans (e.g. target...matchend) from the query result.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...NP,30
It is left as an exercise to the reader to set this anchor in the query above, most conveniently with @[::].
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... span).31
This did not work correctly before CQP v3.4.31, which also introduced support for offsets.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... up32
More than 5 times faster on this author's laptop computer.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.