1.1 The IMS Open Corpus Workbench (CWB)

Subsections

1.1 The IMS Open Corpus Workbench (CWB)

1.1.0.1 History and framework

Tool development
- 1993 – 1996: Project on Text Corpora and Exploration Tools
  (financed by the Land Baden-Württemberg)
- 1998 – 2004: Continued in-house development
  (partly financed by various research and industrial projects)
Related projects and applications at the IMS
- 1994 – 1998: EAGLES project (EU programme LRE/LE)
  (morphosyntactic annotation, part-of-speech tagset, annotation tools)
- 1994 – 1996: DECIDE¹ project (EU programme MLAP-93)
  (extraction of collocation candidates, macro processor mp)
- 1996 – 1999: Construction of a subcategorization lexicon for German
  (PhD thesis Eckle-Kohler, financed by the Land Baden-Württemberg)
- Since 1996: Various commercial and research applications
  (terminology extraction, dictionary updates)
- 1999 – 2000: DOT project (Databank Overheidsterminologie)
- 1999 – 2003: Implementation of YAC chunk parser for German (PhD Kermes)
- 2001 – 2003: Transferbereich 32 (financed by the DFG)
Development as an open software project
- 2005: Code released under GNU GPL by IMS, making CWB henceforth an open, public collaborative enterprise
- 2001 – 2010: Work on first stable open version 3.0, released 2010
- 2010 – 2022: Overlapping work on versions 3.1 (added Windows support), 3.2 (added Unicode support), 3.4 (misc. fixes and enhancements), and 3.5 new stable version)
- v3.5 will be the final iteration of the present CWB; v4 will be a major rewrite
Some external applications of the IMS Corpus Workbench
(see http://cwb.sourceforge.net/demos.php for a longer list)
- AC/DC project at the Linguateca centre (SINTEF, Oslo, Norway)
  (on-line access to a 180 M word corpus of Portuguese newspaper text)
  http://www.linguateca.pt/ACDC/
- CorpusEye (user-friendly CQP) in the VISL project (SDU, Denmark)
  (on-line access to annotated corpora in various languages)
  http://corp.hum.sdu.dk/
- SSLMIT Dev Online services (SSLMIT, University of Bologna, Italy)
  (on-line access to 380 M words of Italian newspaper text and other corpora)
  http://sslmitdev-online.sslmit.unibo.it/corpora/corpora.php $\bgroup\color{highlight}$ \bigl<\!\!\bigl<$\egroup$ site no longer online $\bgroup\color{highlight}$ \bigr>\!\!\bigr>$\egroup$
- CucWeb project (UPF, Barcelona, Spain)
  (Google-style access to 208 million words of text from Catalan Web pages)
  http://ramsesii.upf.es/cucweb/ $\bgroup\color{highlight}$ \bigl<\!\!\bigl<$\egroup$ site no longer online $\bgroup\color{highlight}$ \bigr>\!\!\bigr>$\egroup$
- BNCweb (CQP edition)
  (Web interface to the British National Corpus, ported from SARA to CQP)
  http://corpora.lancs.ac.uk/BNCweb/

1.1.0.2 Technical aspects

CWB uses a bespoke token-based format for corpus storage:
- binary encoding $\Rightarrow$ fast access
- full index $\Rightarrow$ fast look-up of word forms and annotations
- specialised data compression algorithms
- corpus size: up to 2.1 billion words
- text data and annotations cannot be modified after encoding
  (but it is possible to add new annotations or overwrite existing ones)
- early versions assumed Latin-1 text encoding, later versions support multiple 8-bit character sets as well as UTF-8 for Unicode
Typical compression ratios for a 100 million word corpus:
- uncompressed text: $\approx$ 1 GByte (without index & annotations)
- uncompressed CWB attributes: $\approx$ 790 MBytes (ratio: 1.3)
- word forms & lexical attributes: $\approx$ 360 MBytes (ratio: 2.8)
- categorical attributes (e.g. POS tags): $\approx$ 120 MBytes (ratio: 8.5)
- binary attributes (yes/no): $\approx$ 50 MBytes (ratio: 20.5)
Supported operating systems:
- Linux
- Mac OS
- Microsoft Windows (64-bit)
- SUN Solaris
- Source code should compile on most recent Unix platforms (*BSD, Cygwin... etc.)
- Corpus data format is platform-independent and compatible with all releases since 2001

1.1.0.3 Components of the CWB

tools for encoding, indexing, compression, decoding, and frequency distributions
global “registry” holds information about corpora (name, attributes, data path)
corpus query processor (CQP):
- fast corpus search (regular expression syntax)
- use in interactive or batch mode
- results displayed in terminal window
CWB/Perl interface for post-processing, scripting and web interfaces
CQPweb: a browser-based graphical interface to CWB/CQP, with extended analysis tools