4.3 Structural attributes and XML

XML markup of NPs and PPs in the DICKENS corpus (cf. Appendix A.3)

<s len=9> 
  <np h="it" len=1> It </np> 
  is 
  <np h="story" len=6> the story 
    <pp h="of" len=4> of 
       <np h="man" len=3> an old man </np>
    </pp> 
  </np> 
  . 
</s>

key-value pairs within XML start tags are accessible in CQP as additional s-attributes with annotated values (marked [A] in the show cd; listing): s_len, np_h, np_len, pp_h, pp_len (cf. Section 1.2)
s-attribute values can be accessed through label references
> <np> a:[] []* </np> :: a.np_h = "bank";
$\to$ NPs with head lemma bank
an equivalent, but shorter version:
> /region[np,a] :: a.np_h="bank";
or use the match anchor label automatically set to the first token of the match
> <np> []* </np> :: match.np_h="bank";
constraints on key-value pairs can also directly be tested in start tags, using the appropriate auto-generated s-attribute (make sure to use a matching end tag)
> <np_h = "bank"> []* </np_h>;
comparison operators = and != are supported, together with the %c and %d flags;
= is the default and may be omitted
constraints on multiple key-value pairs require multiple start tags
> <np_h="bank"><np_len="[1-6]"> []* </np_len></np_h>;
(or access the value of np_len through a label reference)
<np> and <pp> tags are usually shown without XML attribute values;
they can be displayed explicitly as <np_h>, <np_len>, ... tags:
> show +np +np_h +np_len;
> cat;
(other corpora may show XML attributes in start tags)
use this label for direct access to s-attribute values within pattern
> [(pos="NNS?") & (lemma = _.np_h)];
(recall that np_h would merely return an integer value indicating whether the current token is contained in a <np> region, not the desired annotation string)
typecast numbers to int() for numerical comparison
> /region[np,a] :: int(a.np_len) > 30;
NB: s-attribute annotations can only be accessed with label references:
> [np_h="bank"]; $\quad$ does not work!
regions of structural attributes are non-recursive
$\Rightarrow$ embedded XML regions are renamed at time of indexing to <np1>, <np2>, ... <pp1>, <pp2>, ...
embedding level must be explicitly specified in the query:
> [pos="CC"] <np1> []* </np1>;
will only find NPs contained in exactly one larger NP
(use show +np +np1 +np2; to experiment)
regions representing the attributes in XML start tags are renamed as well:
$\Rightarrow$ <np_h1>, <np_h2>, ..., <pp_len1>, <pp_len2>, ...
> /region[np1, a] :: a.np_h1 = a.np_h within np;
CQP queries typically use maximal NP and PP regions (e.g. to model clauses)
find any NP (regardless of embedding level):
> (<np>|<np1>|<np2>) []* (</np2>|</np1>|</np>);
CQP ensures that a matching pair of start and end tag is picked from the alternatives
observe how results depend on matching strategy (see Section 6.1 for details)
> set MatchingStrategy shortest;
> set MatchingStrategy longest;
> set MatchingStrategy standard;
(re-run the previous query after each set and watch out for “duplicate” matches)
when the query expression shown above is embedded in a longer query, the matching strategy usually has no influence
annotations of a region at an arbitrary embedding level can only be accessed through constraints on key-value pairs in the start tags:
> (<np_h "bank">|<np_h1 "bank">|<np_h2 "bank">) []*
(</np_h2>|</np_h1>|</np_h>);