7.2 Exchanging corpus positions with external programs

An important aspect of interfacing CQP with other software is passing back and forth lists of corpus positions of query matches (as well as target and keyword anchors). This is a prerequisite for extracting further information about the matches by direct corpus access, and it is the most efficient way of relating query matches to externally managed data structures (e.g. metadata held in a SQL database or spreadsheet application).
The dump command (Section 3.3) prints the required information in a tabular ASCII format that can easily be parsed by other tools or read into a SQL database.¹⁸ Each row of the resulting table corresponds to one match of the query, and the four columns give the corpus positions of the match, matchend, target and keyword anchors, respectively. The example below is reproduced from Section 3.3
```
        1019887 1019888 -1      -1
        1924977 1924979 1924978 -1
        1986623 1986624 -1      -1
        2086708 2086710 2086709 -1
        2087618 2087619 -1      -1
        2122565 2122566 -1      -1
```
Undefined target anchors are represented by -1 in the third column. Even though no keyword anchors were set for the query in question, the fourth column is included in the dump table, but with all values set to -1.
The table created by the dump command is printed to standard output (stdout) by default, where it can be captured by a program running CQP as a backend (e.g. the CWB/Perl interface, cf. Sec. 7.1). The dump table can also be redirected to a file:
> dump A > "dump.tbl";
which is automatically compressed if the filename ends in .gz or .bz2 (new in CQP v3.4.11; see Sec. 3.1).
Alternatively, the output can also be redirected to a pipe, e.g. to create a dump file without the superfluous keyword column
> dump A > "| cut -f 1-3 > dump.tbl";
(Windows users should refer to the caveats on use of pipes in Sec. 3.3.)
in versions of CQP prior to v3.4.11, a pipe was needed to compress a dump file on the fly
> dump A > "| gzip > dump.tbl.gz";
Sometimes it is desirable to reload a dump file into CQP after it has been modified by an external program (e.g. a database may have filtered the matches against a metadata table). The undump command creates a new named query result (B in the example below) for the currently activated corpus from a dump file (which may be a compressed file in CQP v3.4.11 and newer):
> undump B < "mydump.tbl";
Undumping data to an NQR (here, B) overwrites that NQR if it already exists, silently and without warning.
The format of files to be undumped, here mydump.tbl, is almost identical to the output of dump, but it contains only two columns: for the match and matchend positions (in the default setting). The example below shows a valid dump file for the DICKENS corpus, which can be read with undump to create a query result containing 5 matches:
```
        20681   20687  
        379735  379741 
        1915978 1915983
        2591586 2591591
        2591593 2591598
```
Save these lines to a text file named dickens.tbl, then enter the following commands:
> DICKENS;
> undump Twas < "dickens.tbl";
> cat Twas;
Further columns for the target and keyword anchors (in that order) can optionally be added. In this case, you must append the modifier with target or with target keyword to the undump command:
> undump B with target keyword < "mydump.tbl";
Dump files can also be read from a pipe or from standard input. In the latter case the table of corpus positions has to be preceded by a header line that specifies the total number of matches:
```
        5
        20681   20687  
        379735  379741 
        1915978 1915983
        2591586 2591591
        2591593 2591598
```
CQP uses this information to pre-allocate internal storage for the query result, as well as to validate the file format. This format can also be used as a more efficient alternative if the dump is read from a regular file. CQP automatically detects which of the two formats is used.
Pipes can be used e.g. to read a dump table generated by another program. They are indicated by a pipe symbol (|) at the start of the filename (new in CQP v3.4.11) or at the end of the filename (earlier versions); see further the notes in Sec. 3.3.
Before CQP v3.4.11, pipes were also needed to read a dump table from a compressed file:
> undump B < "| gzip -cd mydump.tbl.gz";
In an interactive CQP session, the input file can be omitted and the undump table can then be entered directly on the command line. This feature works best if command-line editing support is enabled with the -e switch.
Since the dump table is read from standard input here, only the second format is allowed, i.e. you have to enter the total number of matches first. Try entering the example table above after typing
> undump B;
Without the -e switch, the standard-input format is a little counterintuitive. The initial undump command must be terminated by a semi-colon, which is followed directly by the header number - with no space between the semi-colon and the number!! The remaining lines are entered as usual.
```
        > undump In-Non-E-Mode;2
        1915978 1915983
        2591586 2591591
```
If the rows of the undump table are not sorted in their natural order (i.e. by corpus position), they have to be re-ordered internally so that CQP can work with them. However, the original sort order is recorded automatically and will be used by the cat and dump commands (until it is reset by a new sort command). If you sort a query result A, save it with dump to a text file, and then read this file back in as named query B, then A and B will be sorted in exactly the same order.
In many cases, overlapping or unsorted matches are not intentional but rather errors in an automatically generated dump table. In order to catch such errors, the additional keyword ascending (or asc) can be specified before the < character:
> undump B with target ascending < "mydump.tbl";
This command will abort with an error message (indicating the row number where the error occurred) unless the corpus matches in mydump.tbl are non-overlapping and sorted in corpus order.
A typical use case for dump and undump is to link CQP queries to corpus metadata stored in an external database. Assume that a corpus consists of a large collection of transcribed dialogues, which are marked as <dialogue> regions. Assume further that rich metadata (about the speakers, setting, topic, etc.) is available in a SQL database. The database entries can be linked directly to the <dialogue> regions by recording their start and end corpus positions in the database.¹⁹ The following commands generate a dump table with the required information, which can easily be loaded into the database (ignoring the third and fourth columns of the table):
> A = <dialogue> [] expand to dialogue;
> dump A > "dialogues.tbl";
Corpus queries will often be restricted to a subcorpus by specifying constraints on the metadata. Having resolved the metadata constraints in the SQL database, they can be translated to the corresponding regions in the corpus (again represented by start and end corpus position). The positions are then sorted in ascending order and saved to a TAB-delimited text file. Now they can be loaded into CQP with the undump command, and the resulting query result can be activated as a subcorpus for following queries. It is recommended to specify the ascending option in order to ensure that the loaded query result forms a valid subcorpus:
> undump SubCorpus ascending < "subcorpus.tbl";
> SubCorpus;
Subcorpus[..]> A = ... ;