Some relevant links
Some of the links here may be out of date as we are unable to check them with any regularity. If you find a broken link, we suggest using the Wayback Machine to get an idea of what was previously at that location.
Older CWB project sites
- Old homepage of the CWB at the IMS (University of Stuttgart); now offline, link is to the Wayback Machine's earliest snapshot from 1999; later snapshots are also available.
Similar projects
- NoSketch Engine is another open-source corpus indexing and query engine that uses (almost) the same input format and query language as the CWB. It consists of a corpus query server called Manatee and Tcl/Tk-based GUI component called Bonito. NoSketch Engine is an open-source subset of the popular commercial Sketch Engine system.
- Poliqarp is a more recent concordancing engine, with functionality similar to the CWB and Manatee.
- The CSAR project (Corpus Server Architecture) aims to provide a common Web interface for different corpus query engines.
Users of the CWB
The following is an incomplete list of some projects using CWB as (part of) their corpus software infrastructure. Many other CWB-powered platforms can be found in the list of Web interfaces.
- BNCweb is a specialised Web GUI for the British National Corpus. BNCweb also supports the simplified CEQL query syntax. A public BNCweb server, hosted at Lancaster University, can be accessed here – see info here and complete form here for an account
- TXM is a free and open-source cross-platform Unicode and XML based text/corpus analysis environment and graphical client, supporting Windows, Linux and Mac OS X. It can also be used online as a J2EE standard compliant web portal (GWT based) with access control built in. It offers a comprehensive range of analysis tools (concordances, collocate search, frequency lists, etc.) based on Corpus Workbench's powerful CQP query engine and a range of statistical functions (factorial analysis, classification, cooccurrency analysis, etc.) based on R packages.
- SpoCo is a Web interface for spoken corpora with aligned audio files intended for dialect and language documentation projects. A detailed description can be found in this research paper.
- ParaVoz v2.0 is a specialized Web interface for parallel corpora (also see v1.0).
- spheroscope is a Web-based GUI for the development of complex queries with the help of CQP macros and word lists, designed for applications in argumentation mining
Online demos
Official web-accessible CQP demos are hosted by the Computational Corpus Linguistics Group at FAU Erlangen-Nürnberg, Germany. These demos allow you to run CQP queries on selected corpora with various display and post-processing options. They can be used to walk through large parts of the CQP Query Language Manual without installing a local copy of the CWB and tutorial corpora. Click the links below to access the available corpora:
-
DICKENS
(English, 3.4M tokens)A collection of novels by Charles Dickens used as the main example corpus in the CQP Query Language Tutorial.
-
BUNDESTAG
(German, 5.7M tokens)Debates of the German parliament (1994–1998) with rich morphosyntactic annotation and shallow parsing. Suitable as a substitute for the smaller
GLAW-NEW
corpus of law texts in the CQP Query Language Tutorial. -
EUROPARL
(6 languages, ca. 40M tokens each)Web GUI for the annotated Europarl Corpus, Version 3 containing debates of the European Parliament from the years 1996–2006 (currently, only six languages are included in the GUI). This interface also supports the simplified CEQL syntax, aligned context display and word lists with automatic generation of translation candidates. The Europarl corpus will be used by future editions of the CQP Query Language Tutorial to introduce query and display options for aligned copora.
Many institutions now host public CQPweb installations. The oldest is maintained by the developer at Lancaster University. An external list of CQPweb servers, and other web interfaces using CWB/CQP, has been compiled by Xu Jiajin of BFSU.
Other online systems include:
- Korpus 2000 (Danish)
- Swedish corpora at Språkbanken (PAROLE & SUC)
- VISL CorpusEye – a friendly interface to annotated corpora in multiple languages
- FALKO – a German learner corpus (HU Berlin)
- Linguateca AC/DC provides access to many Portuguese corpora, including a treebank
- Web interfaces for the OPUS parallel corpora
- CucWeb – a Web corpus of Catalan texts (UPF Barcelona) (now uses CQPweb)
- Serge Sharoff's corpus collection: English, Russian, Chinese, Web corpus in 12 languages (U Leeds)
- Bwananet is a Web GUI for the Corpus Tècnic in Spanish, Catalan and English (IULA, UPF Barcelona)
- BancTrad – translated texts in multiple languages (UPF Barcelona) (now uses CQPweb)
- TScorpus - a Web GUI for Taner Sezer's 491 million word corpus of Turkish
- Glossa search tool (tekstlab, U Oslo) – various mono- and multilingual corpora (contact tekstlab for a free guest account)
- ParaSol (Ruprecht v. Waldenfels, Roland Meyer) – a multilingual parallel corpus focussing on Slavic languages (free demo)
- TEITOK (Maarten Janssen) – a platform for indexing & searching TEI-encoded corpora
- SSLMIT Dev Online (U Bologna, Forlì) – 380M words of Italian newspaper text from La Repubblica, plus other corpora (free registration)
If you are offering a public Web interface based on the CWB and would like it to be listed here, please drop us a line.
Other links
Python APIs
Jørg Asmussen and Yannick Versley have developed a Python API for CQP and the low-level CL library, based on the corresponding CWB/Perl modules. The cwb-python module can be downloaded from PyPI as an installable package, but is no longer maintained.
A newer module cwb-ccc works around some issues in cwb-python and provides higher-level convenience functions. Developed and actively maintained by Philipp Heinrich, it can directly be installed from PyPI.
R APIs
The R package RcppCWB provides a full-fledged and reliable interface to CQP queries and low-level corpus access via the CL library from within R. The package can be installed directly from CRAN using the standard R package manager or the command install.packages("RcppCWB")
. Binary releases for MacOS and Windows work without any other prerequisites. The package has been developed, and is actively maintained, by Andreas Blätte.
An earlier attempt to provide an R API for CWB was the package rcqp, developed by Bernard Desgraupes and Sylvain Loiseau. It is no longer available on CRAN due to compatibility issues, but the source code can still be obtained from the CRAN archive. It can be compiled from source on Linux and Mac OS X, provided that the external dependencies of CWB 3.5β have been installed. In particular, the Glib2 and PCRE libraries are required (see this document for details).