4 Indexing and compression without CWB/Perl

If you do not have the CWB/Perl interface installed, the best thing you can do is to install the CWB/Perl modules and the scripts it includes, and then go back to Section 2. If it is impossible to install CWB/Perl (for example, on Windows), or if you really want to learn the nitty-gritty of corpus encoding, continue here.

When the index files have been created, the corpus can already be used with CQP and other CWB tools. However, it is recommended that you compress the binary data files to save disk space and improve performance. For very small corpora (under 10 million tokens) the compression won't make a lot of difference; for corpora larger than that, it probably will. Compression is only supported for p-attributes at present.

When compression was successful, both tools will print the full pathnames of uncompressed data files that are now redundant and can be deleted: attrib.corpus after running cwb-huffcode; attrib.corpus.rev and attrib.corpus.rdx after running cwb-compress-rdx.

If you run cwb-makeall again, it will show now that the p-attributes are compressed. The compressed data files are validated by default, so it is safe to remove the redundant files.

Validation can be turned off in both cwb-huffcode and cwb-compress-rdx, using the -T option (T for trust). However, letting validation run causes much less performance problems than can arise from validation with cwb-makeall.