Nowadays, machine-readable text and linguistic annotations are often provided in XML format. CWB's XML support is activated by the following encoding options: -x for XML compatibility mode (recognises default entities and skips comments as well as an XML declaration), -s to skip blank lines in the input, and -B to strip whitespace from tokens. All three options -xsB should (almost) always be used.10 The vertical text format with TAB-separated p-attributes is still required by cwb-encode, but this format can easily be generated from an arbitrary XML file with the aid of a little script in any suitable language. Figure 3 shows a typical example of an XML input file for the CWB, including an XML declaration and a comment line that options -xsB will cause to be ignored. Note that despite the use of tabs for columns, this is still a well-formed XML file.
XML elements (i.e. matching pairs of start and end tags) can be
encoded as s-attributes, which have to be declared with -S flags (for
the file vss.vrt, the flags -S story -S p -S s
would be used). If XML regions of the same type are
nested, encoding will only work correctly if you add :0 to
the s-attribute declaration, which enables a rudimentary XML parser built into
cwb-encode. Attribute-value pairs in XML start tags, such as
<story num="4" title="A Thrilling Experience">can be stored as a single unparsed text string (
num="4" title="A Thrilling Experience"
) by using the flag -V
instead of -S. This form of encoding is not convenient for CQP queries,
though. It is more desirable to declare XML tag attributes explicitly; doing so will
automatically split the XML elements into multiple s-attributes.
$ cwb-encode -d /corpora/data/vss -f vss.vrt -R /usr/local/share/cwb/registry/vss -xsBC -c ascii -P pos -P lemma -S s:0 -S p:0 -S story:0+num+title -0 collection $ cwb-make -V VSS
If you do not have the cwb-make script available, follow the steps in Section 4.
These commands will encode the corpus VSS and create a registry file,
including the s-attributes s, p, story,
story_num, and story_title. The <story>
start tags
are parsed and the attribute values are stored as annotations of the
attributes story_num
(value: 4) and story_title
(value:
A Thrilling Experience). Regions of the story attribute
itself will not be annotated. Use -V instead of -S to store
all attribute-value pairs as a single string, which can be useful for
displaying and re-exporting the XML tags.
XML elements with different names (such as <s>
and <p>
) are
encoded independently, so they can nest and overlap in arbitrary ways. The
cwb-encode program does not perform any validation or well-formedness
tests on the XML elements.
When elements are nested recursively (e.g. a <table>
within a
<table>
), the embedded elements will be ignored,
because of the use of :0 specified above. After
encoding, cwb-encode prints a summary listing the number of dropped
XML elements. If you instead want to preserve nested elements, you can specify a
maximal level of embedding instead of :0 in the examples above. For
instance, -S table:2
allows two levels of embedding for <table>
elements. Nested elements are automatically renamed to <table1>
and <table2>
, respectively, and stored in separate s-attributes.
Sometimes, the input data may contain XML tags that should not be encoded in
the corpus. For instance, the stories in vss.vrt have to be wrapped in
a single root element <collection>
in order to obtain a well-formed XML
file. Instead of removing such tags during data preparation, they can
directly be filtered out by the cwb-encode tool. For this purpose,
they have to be declared with the flag -0 (digit zero, for “null
attribute”) instead of -S or -V. All start and end tags of
these elements will be ignored completely. There is no need to add
:0 or XML attribute declarations. Note that all XML tags that have
not been declared with a -S, -V or -0 flag will be
encoded as literal tokens (that is, words, without annotations),
accompanied by a warning message.
Starting with CWB 3.4.21, unknown XML tags can automatically be declared as null attributes with the -9 (“auto-null”) option. This is recommended to capture the correct token stream for an input file with very many and/or undocumented XML elements:
$ cwb-encode -d /corpora/data/vss -f vss.vrt -R /usr/local/share/cwb/registry/vss -xsBC -c ascii -9 -P pos -P lemma
You may have noticed in Figure 3 that the XML file is declared to be in ISO 8859-1 (or Latin-1) encoding rather than the standard UTF-8 format. CWB ignores this, along with the rest of the XML declaration. The charset still needs to specified on the command line; here, it is -c ascii, since we know there are no non-ASCII characters in this particular file; see also section 2.