4 Indexing and compression without CWB/Perl

If you do not have the CWB/Perl interface installed, the best thing you can do is to install the CWB/Perl modules and the scripts it includes, and then go back to Section 2. If it is impossible to install CWB/Perl (for example, on Windows), or if you really want to learn the nitty-gritty of corpus encoding, continue here.

In the manual procedure, indexing and compression are performed in separate steps by different tools. First, you have to run cwb-makeall in order to build the necessary index files.
```
cwb-makeall -V EXAMPLE
```
cwb-makeall accepts the same -V, -M and -r options as cwb-make. The comments on enabling/disabling validation given above with regards to cwb-make naturally apply to cwb-makeall as well.

When the index files have been created, the corpus can already be used with CQP and other CWB tools. However, it is recommended that you compress the binary data files to save disk space and improve performance. For very small corpora (under 10 million tokens) the compression won't make a lot of difference; for corpora larger than that, it probably will. Compression is only supported for p-attributes at present.

For positional attributes, both the token stream data and the index can be compressed. There are separate tools for compressing the two types of data files.
The token stream can be compressed with the cwb-huffcode tool. Use the -P option to process a single attribute, or compress all p-attributes with -A.
```
$ cwb-huffcode -A EXAMPLE
```
Index files can be compressed with the cwb-compress-rdx tool, which accepts the same options.
```
$ cwb-compress-rdx -A EXAMPLE
```

When compression was successful, both tools will print the full pathnames of uncompressed data files that are now redundant and can be deleted: attrib.corpus after running cwb-huffcode; attrib.corpus.rev and attrib.corpus.rdx after running cwb-compress-rdx.

If you run cwb-makeall again, it will show now that the p-attributes are compressed. The compressed data files are validated by default, so it is safe to remove the redundant files.

Validation can be turned off in both cwb-huffcode and cwb-compress-rdx, using the -T option (T for trust). However, letting validation run causes much less performance problems than can arise from validation with cwb-makeall.

NB: If you re-encode a corpus, it is important to erase all files in the data directory first. The cwb-makeall program will not recognize that existing index files or compressed data files are out of date, and will therefore fail to rebuild them automatically. (This is one of the reasons why the CWB/Perl cwb-make tool should be preferred.)