The following is a sample registry file created by cwb-encode. The cwb-regedit also creates registry files in this format.
## ## registry entry for corpus BNCSAMPLER ## # long descriptive name for the corpus NAME "" # corpus ID (must be lowercase in registry!) ID bncsampler # path to binary data files HOME /home/Corpora/data/bncsampler # optional info file (displayed by "info;" command in CQP) INFO /home/Corpora//bncsampler/.info # corpus properties provide additional information about the corpus: ##:: charset = "utf8" # change if your corpus uses different charset ##:: language = "??" # insert ISO code for language (de, en, fr, ...) ## ## p-attributes (token annotations) ## ATTRIBUTE word ATTRIBUTE pos ATTRIBUTE hw ATTRIBUTE semtag ATTRIBUTE class ATTRIBUTE lemma ## ## s-attributes (structural markup) ## # <text id=".."> ... </text> # (no recursive embedding allowed) STRUCTURE text STRUCTURE text_id # [annotations] # <s> ... </s> STRUCTURE s # Yours sincerely, the Encode tool.
CWB traditionally had a more flexible registry file format (which is still accepted for backward compatibility), which could contain a variety of other declarations. The standard format for new corpora, however, is as given above; we recommend that you stick to this format, since it is in fact enforced by the CWB/Perl scripts.
Finally, it is worth noting that directory and file paths in HOME and INFO
entries have to be double-quoted if they contain blanks or other
non-standard characters (ASCII letters, digits, -
, _
, /
and .
are ok, as long as the path does not begin with .
). In
a double-quoted path, "
must be escaped as \"
and the backslash
\
as \\
. If you use cwb-encode and cwb-regedit,
they should always create valid entries, with quotes added when necessary.