A. Appendix: Registry file format

The following is a sample registry file created by cwb-encode. The cwb-regedit also creates registry files in this format.

##
## registry entry for corpus BNCSAMPLER
##

# long descriptive name for the corpus
NAME ""
# corpus ID (must be lowercase in registry!)
ID   bncsampler
# path to binary data files
HOME /home/Corpora/data/bncsampler
# optional info file (displayed by "info;" command in CQP)
INFO /home/Corpora//bncsampler/.info

# corpus properties provide additional information about the corpus:
##:: charset  = "utf8" # change if your corpus uses different charset
##:: language = "??"     # insert ISO code for language (de, en, fr, ...)


##
## p-attributes (token annotations)
##

ATTRIBUTE word
ATTRIBUTE pos
ATTRIBUTE hw
ATTRIBUTE semtag
ATTRIBUTE class
ATTRIBUTE lemma


##
## s-attributes (structural markup)
##

# <text id=".."> ... </text>
# (no recursive embedding allowed)
STRUCTURE text
STRUCTURE text_id              # [annotations]

# <s> ... </s>
STRUCTURE s


# Yours sincerely, the Encode tool.

CWB traditionally had a more flexible registry file format (which is still accepted for backward compatibility), which could contain a variety of other declarations. The standard format for new corpora, however, is as given above; we recommend that you stick to this format, since it is in fact enforced by the CWB/Perl scripts.

Finally, it is worth noting that directory and file paths in HOME and INFO entries have to be double-quoted if they contain blanks or other non-standard characters (ASCII letters, digits, -, _, / and . are ok, as long as the path does not begin with .). In a double-quoted path, " must be escaped as \" and the backslash \ as \\. If you use cwb-encode and cwb-regedit, they should always create valid entries, with quotes added when necessary.