Ziggurat and the future of CWB

Ziggurat is the name of our current project to develop a successor to the corpus-indexing technology that underpins the Corpus Workbench.

The project aims to produce:

A new data model definition (and a corresponding index file format)
An engine for indexing and retrieval
A new version of CWB that uses the aforementioned

Please join the CWBdev mailing list if you are interested in hearing about our progress on Ziggurat.

So far, we have developed a proposal (not wholly finalised) for the data model and file formats. You can read about this in the documents linked below. The next step will be a rough prototype for the Ziggurat API. Finally, we will move on to full implementation.

Corpus Workbench version 4: the future

Although we hope that Ziggurat will be broadly useful, its primary purpose is to power CWB v 4.

CWB 4 will be a complete re-design that uses Ziggurat to improve flexibility and scalability. Some design goals:

Lift the limit on corpus size in CWB v 3 (2.1 billion words)
Better support for querying hierarchical (XML) markup
Better support for querying dependency-annotated corpora
Better support for complex annotation values (e.g. sets)
Easier programmatic access to CQP queries
More robust corpus management (goodbye registry directory)
Rebalanced trade-off between disk space and processor time

Work on CWB v 4 will begin after the 1.0 release of Ziggurat itself.

The Ziggurat data model

Ziggurat data model and file format, Draft 1.5 (OSF)

Comments are welcome! You might want to read Evert & Hardie (2015) first to get an overview.

New versions of this and other Ziggurat documents will be published soon. As always code will appear on the SourceForge repository as we write it.

How to install

Ziggurat is not end user software. Most people will never need to install it. Only developers need to consider doing so.

To learn more about Ziggurat, see:

Evert, S. and Hardie, A. (2015). Ziggurat: A new data model and indexing format for large annotated text corpora. In Proceedings of the 3rd Workshop on the Challenges in the Management of Large Corpora (CMLC-3), pages 21–27, Lancaster, UK. (PDF)
Evert, S. and Hardie, A. (2021). Ziggurat v0.1: A next-generation system for modelling, storing, and retrieving corpus (and other) data. Presentation at Corpus Linguistics 2021, Limerick (online). (Slides, Video on YouTube)
Evert, S.; Hardie, A.; Weber, T. (2023). The Ziggurat data model and file format (draft 1.5). Technical report. Available from https://osf.io/n75es/.

Ziggurat and the future of CWB

Corpus Workbench version 4: the future

The Ziggurat data model

How to install

Read more