Subsections
1.1 The IMS Open Corpus Workbench (CWB)
- Tool development
- 1993 – 1996: Project on Text Corpora and Exploration Tools
(financed by the Land Baden-Württemberg)
- 1998 – 2004: Continued in-house development
(partly financed by various research and industrial projects)
- Related projects and applications at the IMS
- 1994 – 1998: EAGLES project (EU programme LRE/LE)
(morphosyntactic annotation, part-of-speech tagset, annotation tools)
- 1994 – 1996: DECIDE1 project (EU programme MLAP-93)
(extraction of collocation candidates, macro processor mp)
- 1996 – 1999: Construction of a subcategorization lexicon for German
(PhD thesis Eckle-Kohler, financed by the Land Baden-Württemberg)
- Since 1996: Various commercial and research applications
(terminology extraction, dictionary updates)
- 1999 – 2000: DOT project (Databank Overheidsterminologie)
- 1999 – 2003: Implementation of YAC chunk parser for German (PhD Kermes)
- 2001 – 2003: Transferbereich 32 (financed by the DFG)
- Development as an open software project
- 2005: Code released under GNU GPL by IMS, making CWB henceforth an
open, public collaborative enterprise
- 2001 – 2010: Work on first stable open version 3.0, released 2010
- 2010 – 2022: Overlapping work on versions 3.1 (added Windows support),
3.2 (added Unicode support), 3.4 (misc. fixes and enhancements), and 3.5
new stable version)
- v3.5 will be the final iteration of the present CWB; v4 will be a major rewrite
- Some external applications of the IMS Corpus Workbench
(see http://cwb.sourceforge.net/demos.php for a longer list)
- CWB uses a bespoke token-based format for corpus storage:
- binary encoding
fast access
- full index
fast look-up of word forms and annotations
- specialised data compression algorithms
- corpus size: up to 2.1 billion words
- text data and annotations cannot be modified after encoding
(but it is possible to add new annotations or overwrite existing ones)
- early versions assumed Latin-1 text encoding, later versions support
multiple 8-bit character sets as well as UTF-8 for Unicode
- Typical compression ratios for a 100 million word corpus:
- uncompressed text: 1 GByte (without index & annotations)
- uncompressed CWB attributes: 790 MBytes (ratio: 1.3)
- word forms & lexical attributes: 360 MBytes (ratio: 2.8)
- categorical attributes (e.g. POS tags): 120 MBytes (ratio: 8.5)
- binary attributes (yes/no): 50 MBytes (ratio: 20.5)
- Supported operating systems:
- Linux
- Mac OS
- Microsoft Windows (64-bit)
- SUN Solaris
- Source code should compile on most recent Unix platforms (*BSD, Cygwin... etc.)
- Corpus data format is platform-independent and compatible with all
releases since 2001
- tools for encoding, indexing, compression, decoding, and frequency
distributions
- global “registry” holds information about corpora (name, attributes,
data path)
- corpus query processor (CQP):
- fast corpus search (regular expression syntax)
- use in interactive or batch mode
- results displayed in terminal window
- CWB/Perl interface for post-processing, scripting and web interfaces
- CQPweb: a browser-based graphical interface to CWB/CQP, with extended analysis tools