6.6 Feature set attributes (GERMAN-LAW)

feature set attributes use special notation, separating set members by | characters (both in the input files and in the indexed corpus itself)

they can be used to represent annotations that can be ambiguous and/or multi-valued

e.g. for an alemma (ambiguous lemma) attribute

ambiguity() function yields number of elements in set (its cardinality)

> [ambiguity(alemma) > 3];

use contains operator to test for membership in the set

> [alemma contains "Zeuge"];
$\to$ words which can be lemmatised as Zeuge

equivalent to [alemma = ".*\|Zeuge\|.*"]

test non-membership with not contains

(alemma not contains "Zeuge")
$\Longleftrightarrow$ !(alemma contains "Zeuge")

also used to annotate phrases with sets of properties

> /region[np, a] :: a.np_f contains "quot";

see Appendix A.3 for lists of properties annotated in the GERMAN-LAW corpus

define macro for easy experimentation with property features

> define macro find('$0=Tag $1=Property')
'<$0_f contains "$1"> []* </$0_f>';

> /find[np, brac];
> /find[advp, temp];
etc.

nominal agreement features of determiners, adjectives and nouns are stored in the agr attribute, using the pattern shown in Figure 7 (see Figure 8 for an example)

**Figure 7:** Annotation of noun agreement features in the GERMAN-LAW corpus.
$\begin{figure}\begin{center} \begin{tabular}{l} \lq\lq \textit{case}\texttt{:}\text... ...xttt{Ind}, \texttt{Nil} \end{tabular} \end{tabular} \end{center} \end{figure}$

**Figure 8:** An example of noun agreement features in the GERMAN-LAW corpus
$\begin{figure}\begin{center} \begin{tabular}{\vert ll\vert} \hline \verb/der/... ...l\vert Nom:M:Pl:Nil\vert/ \\ \hline \end{tabular} \end{center} \end{figure}$

require all set members to match a regular expression

> [ (pos = "NN") & (agr matches ".*:Pl:.*") ];
$\to$ nouns which are uniquely identified as plurals

both contains and matches use regular expressions and accept the %c and %d flags

unification of agreement features $\Longleftrightarrow$ intersection of feature sets

use built-in /unify[] macro:

/unify[agr, <label1>, <label2>, ...]

undefined labels will automatically be ignored

> a:[pos="ART"] b:[pos="ADJA"]? c:[pos="NN"]
:: /unify[agr, a,b,c] matches "Gen:.*";
$\to$ (simple) NPs uniquely identified as genitive

> a:[pos="ART"] b:[pos="ADJA"]? c:[pos="NN"]
:: /unify[agr, a,b,c] contains "Dat:.:Sg:.*";

$\to$ NPs which might be dative singular

use ambiguity() function to find number of possible analyses

> ... :: ambiguity(/unify[agr, a,b,c]) >= 1;
$\to$ to check agreement within NP

in the GERMAN-LAW corpus, NPs and other phrases are annotated with partially disambiguated agreement information; these features sets can also be tested with the contains and matches operators, either indirectly through label references or directly in XML start tags

> /region[np, a] :: a.np_agr matches "Dat:.:Pl:.*";
> <np_agr matches "Dat:.:Pl:.*"> []* </np_agr>;

for computation speed, /unify[] expects features sets in canonical format, with members sorted according to CWB's internal sort order; this is usually ensured with the -m option to cwb-s-encode

even if an attribute hasn't explicitly been defined as a feature set (and converted to canonical format), ambiguity(), contains and matches are guaranteed to work as long as the |-separated set notation is used correctly and consistently

however, the /unify[] macro cannot be used unless the features within each set are sorted in the canonical format. Only if an attribute is explicitly declared as a feature set at indexing-time are the members of the sets sorted into the canonical order.

thus, feature set attributes cannot encode ordered lists of values; if you need to distinguish between a first, second, ... alternative, you might add this information explicitly as a feature component, e.g.

|1:Zeuge|2:Zeug|3:Zeugen|

`\|Zeug\|Zeuge\|Zeugen\|`	(three elements)
`\|Baum\|`	(unique lemma)
`\|`	(not in lexicon)