- An important aspect of interfacing CQP with other software is passing
back and forth lists of corpus positions of query matches (as well as
target and keyword anchors). This is a prerequisite for
extracting further information about the matches by direct corpus access,
and it is the most efficient way of relating query matches to externally
managed data structures (e.g. metadata held in a SQL database or spreadsheet
application).
- The dump command (Section 3.3) prints the
required information in a tabular ASCII format that can easily be parsed by
other tools or read into a SQL database.18 Each row of the resulting table corresponds to one match of the query, and
the four columns give the corpus positions of the match,
matchend, target and keyword anchors,
respectively. The example below is reproduced from
Section 3.3
1019887 1019888 -1 -1
1924977 1924979 1924978 -1
1986623 1986624 -1 -1
2086708 2086710 2086709 -1
2087618 2087619 -1 -1
2122565 2122566 -1 -1
Undefined target anchors are represented by -1 in the
third column. Even though no keyword anchors were set for the query
in question, the fourth column is included in the dump table,
but with all values set to -1.
- The table created by the dump command is printed to standard output
(stdout) by default, where it can be captured by a program running
CQP as a backend (e.g. the CWB/Perl interface, cf. Sec. 7.1). The dump table can also be redirected
to a file:
> dump A > "dump.tbl";
which is automatically compressed if the filename ends in .gz or .bz2
(new in CQP v3.4.11; see Sec. 3.1).
- Alternatively, the output can also be redirected to a pipe, e.g. to
create a dump file without the superfluous keyword column
> dump A > "| cut -f 1-3 > dump.tbl";
(Windows users should refer to the caveats on use of pipes
in Sec. 3.3.)
- in versions of CQP prior to v3.4.11, a pipe was needed to compress a dump
file on the fly
> dump A > "| gzip > dump.tbl.gz";
- Sometimes it is desirable to reload a dump file into CQP after it has
been modified by an external program (e.g. a database may have filtered the
matches against a metadata table). The undump command creates a new
named query result (B in the example below) for the currently
activated corpus from a dump file (which may be a compressed file in CQP
v3.4.11 and newer):
> undump B < "mydump.tbl";
Undumping data to an NQR (here, B) overwrites that NQR if it already exists,
silently and without warning.
- The format of files to be undumped, here mydump.tbl, is almost
identical to the output of dump, but it contains only two columns: for the
match and matchend positions (in the default setting).
The example below shows a valid dump file for the DICKENS corpus,
which can be read with undump to create a query result containing 5
matches:
20681 20687
379735 379741
1915978 1915983
2591586 2591591
2591593 2591598
Save these lines to a text file named dickens.tbl, then enter the
following commands:
> DICKENS;
> undump Twas < "dickens.tbl";
> cat Twas;
- Further columns for the target and keyword anchors (in
that order) can optionally be added. In this case, you must append the
modifier with target or with target keyword to the
undump command:
> undump B with target keyword < "mydump.tbl";
- Dump files can also be read from a pipe or from standard input.
In the latter case the table of corpus positions has to be preceded by a
header line that specifies the total number of matches:
5
20681 20687
379735 379741
1915978 1915983
2591586 2591591
2591593 2591598
CQP uses this information to pre-allocate internal storage for the query
result, as well as to validate the file format. This format can also be
used as a more efficient alternative if the dump is read from a regular
file. CQP automatically detects which of the two formats is used.
- Pipes can be used e.g. to read a dump table generated by another
program. They are indicated by a pipe symbol (
|
) at the start of the
filename (new in CQP v3.4.11) or at the end of the filename (earlier
versions); see further the notes in Sec. 3.3.
- Before CQP v3.4.11, pipes were also needed to read a dump table
from a compressed file:
> undump B < "| gzip -cd mydump.tbl.gz";
- In an interactive CQP session, the input file can be omitted and the
undump table can then be entered directly on the command line. This feature
works best if command-line editing support is enabled with the -e
switch.
- Since the dump table is read from standard input here, only the second
format is allowed, i.e. you have to enter the total number of matches
first. Try entering the example table above after typing
> undump B;
- Without the -e switch, the standard-input format is a little
counterintuitive. The initial undump command must be terminated by a
semi-colon, which is followed directly by the header number - with
no space between the semi-colon and the number!! The remaining lines are
entered as usual.
> undump In-Non-E-Mode;2
1915978 1915983
2591586 2591591
- If the rows of the undump table are not sorted in their natural order
(i.e. by corpus position), they have to be re-ordered internally so that
CQP can work with them. However, the original sort order is recorded
automatically and will be used by the cat and dump
commands (until it is reset by a new sort command). If you sort a
query result A, save it with dump to a text file, and then
read this file back in as named query B, then A and
B will be sorted in exactly the same order.
- In many cases, overlapping or unsorted matches are not intentional but
rather errors in an automatically generated dump table. In order to catch
such errors, the additional keyword ascending (or asc) can
be specified before the
<
character:
> undump B with target ascending < "mydump.tbl";
This command will abort with an error message (indicating the row number
where the error occurred) unless the corpus matches in mydump.tbl are
non-overlapping and sorted in corpus order.
- A typical use case for dump and undump is to link CQP
queries to corpus metadata stored in an external database. Assume that
a corpus consists of a large collection of transcribed dialogues, which are
marked as <dialogue> regions. Assume further that rich metadata (about the
speakers, setting, topic, etc.) is available in a SQL database. The
database entries can be linked directly to the <dialogue> regions
by recording their start and end corpus positions in the database.19 The following commands generate a dump table with the required information,
which can easily be loaded into the database (ignoring the third and fourth
columns of the table):
> A = <dialogue> [] expand to dialogue;
> dump A > "dialogues.tbl";
Corpus queries will often be restricted to a subcorpus by specifying
constraints on the metadata. Having resolved the metadata constraints in
the SQL database, they can be translated to the corresponding regions in the
corpus (again represented by start and end corpus position). The positions
are then sorted in ascending order and saved to a TAB-delimited text file.
Now they can be loaded into CQP with the undump command, and the
resulting query result can be activated as a subcorpus for following
queries. It is recommended to specify the ascending option in
order to ensure that the loaded query result forms a valid subcorpus:
> undump SubCorpus ascending < "subcorpus.tbl";
> SubCorpus;
Subcorpus[..]> A = ... ;