- A named query result can be “translated” to an aligned corpus, which
allows more flexible display of the aligned regions, access to metadata,
etc. (new in CQP v3.4.9).
- Consider the following example:
> EUROPARL-DE;
> set Context 1 s;
> Zeit = [lemma = "Zeit"];
- The NQR Zeit now contains all occurrences of the German word
for time in the German part of EuroParl. The following command
“translates” the NQR to the English part of EuroParl, i.e. it replaces
each match by the complete aligned region in the target corpus (as would be
displayed with show +europarl-en;.
> Time = from Zeit to EUROPARL-EN;
- This creates a new NQR EUROPARL-EN:Time containing the aligned
regions. You can now e.g. tabulate or count metadata:
> tabulate EUROPARL-EN:Time match text_date;
> group EUROPARL-EN:Time match text_date;
- The somewhat arcane syntax of the command avoids introduction of a new reserved keyword
- while it looks similar to a corpus query or set operation, the
assignment to a new NQR is mandatory (otherwise the parser won't accept
the syntax)
- note that the new NQR must be specified as a short name; the name of
the target corpus is implied and added automatically with the assignment
- Some important details:
- matching ranges that are not aligned to the target corpus are silently
discarded; you cannot expect the new NQR to contain the same number of
hits as the original NQR
- if there are multiple matches in the same alignment bead, they will
not be collapsed in the target corpus; i.e. the new NQR will
contain several identical ranges
- in order to collate source matches with the aligned regions, make sure
to discard unaligned hits from the original NQR first:
> Zeit = [lemma = "Zeit"] :EUROPARL-EN [];
or post-hoc as a subquery filter
> Zeit;
> ZeitAligned = <match> [] :EUROPARL-EN [] !;
- Do not cat the translated query directly (cat
EUROPARL-EN:Time;) without first activating the target corpus, as this
would corrupt the context descriptor (see Sec. 3.1).
The correct procedure is
> EUROPARL-EN;
> cat Time;
You can now customize the KWIC display as desired.
- But it is safe to apply dump, tabulate, group,
count and similar operations. Only commands that auto-print the
NQR (including a bare sort or a set operation) will trigger the bug.
- The problem is mentioned in this section because users are most likely
to be tempted to do this when working with a set of aligned corpora.
- As a second example, we will return to German translations of
nuclear power.
> EUROPARL-DE;
> Other = from EUROPARL-EN:Other to EUROPARL-DE;
- We can now run a subquery on the aligned regions in the German part of
EuroParl in order search for possible translations other than Kern-
and Atom-. One possibility is that nuclear power plant has
been translated into the acronym AKW (for Atomkraftwerk).
> Other;
> [lemma = "AKW"];
- Further translation candidates can be found by computing a frequency
breakdown of all nouns in the aligned sentences:
> N = [pos = "N.*"];
> group N match word;
- We could have applied the same strategy to the NQR Nuke in
order to determine the frequencies of different translation equivalents:
> Nuke = from EUROPARL-EN:Nuke to EUROPARL-DE;
> Nuke;
> TEs = "(Atom|Kern|AKW).*";
> group TEs match lemma;