word | word forms (“plain text”) |
pos | part-of-speech tags (Penn Treebank tagset) |
lemma | base forms (lemmata) |
novel | individual novels |
novel_title | title of the novel |
book | when text is subdivided into books |
book_num | number of the book |
chapter | chapters |
chapter_num | number of the chapter |
chapter_title | optional title of the chapter |
title | encloses title strings of novels, books, and chapters |
p | paragraphs |
p_len | length of the paragraph (in words) |
s | sentences |
s_len | length of the sentence (in words) |
np | noun phrases |
np_h | head lemma of the noun phrase |
np_len | length of the noun phrase (in words) |
pp | prepositional phrases |
pp_h | functional head of the PP (preposition) |
pp_len | length of the PP (in words) |
word | word forms (“plain text”) |
pos | part-of-speech tag (STTS tagset) |
lemma | base forms (lemmatised forms) |
alemma | ambiguous lemmatisation (feature set, see examples in Section 6.6) |
agr | noun agreement features (feature set, see examples in Section 6.6) |
Each agreement feature has the form ccc:g:nn:ddd with
ccc | = | case | (Nom, Gen, Dat, Akk) |
---|---|---|---|
g | = | gender | (M, F, N) |
nn | = | number | (Sg, Pl) |
ddd | = | determination | (Def, Ind, Nil) |
<s> | sentences |
<pp> | prepositional phrases |
<np> | noun phrases |
<ap> | adjectival phrases |
<advp> | adverbial phrases |
<vc> | verbal complexes |
<cl> | subclauses |
<s len="..">
<pp f=".." h=".." agr=".." len="..">
<np f=".." h=".." agr=".." len="..">
<ap f=".." h=".." agr=".." len="..">
<advp f=".." len="..">
<vc f=".." len="..">
<cl f=".." h=".." vlem=".." len="..">
len = length of region (in tokens)
f = properties (feature set, see next page)
h = lexical head of phrase (<pp_h>: “prep:noun”)
agr = nominal agreement features (feature set, partially disambiguated)
vlem = lemma of main verb
<np_f> | norm (“normal” NP), ne (named entity), |
rel (relative pronoun), wh (wh-pronoun), pron (pronoun), | |
refl (reflexive pronoun), es (es), sich (sich), | |
nodet (no determiner), quot (in quotes), brac (in parentheses), | |
numb (list item), trunc (contains truncated nouns), | |
card (cardinal number), date (date string), year (specifies year), | |
temp (temporal), meas (measure noun), | |
street (address), tel (telephone number), news (news agency) | |
<pp_f> | same as <np_f> (features are projected from NP) |
+ nogen (no genitive modifier) | |
<ap_f> | norm (“normal” AP), pred (predicative AP), |
invar (invariant adjective), vder (deverbal adjective), | |
quot (in quotes), pp (contains PP complement), | |
hypo (uncertain, AP was conjectured by chunker) | |
<advp_f> | norm, temp (temporal adverbial), loc (locative adverbial), |
dirfrom (directional source), dirto (directional path) | |
<vc_f> | norm, inf (infinitive), zu (zu-infinitive) |
<cl_f> | rel (relative clause), subord (subordinate clause), |
fin (finite), inf (infinitive), comp (comparative clause) |