Ziggurat and the future of CWB
Ziggurat is the name of our current project to develop a successor to the corpus-indexing technology that underpins the Corpus Workbench.
The project aims to produce:
- A new data model definition (and a corresponding index file format)
- An engine for indexing and retrieval
- A new version of CWB that uses the aforementioned
Please join the CWBdev mailing list if you are interested in hearing about our progress on Ziggurat.
So far, we have developed a proposal (not wholly finalised) for the data model and file formats. You can read about this in the documents linked below. The next step will be a rough prototype for the Ziggurat API. Finally, we will move on to full implementation.
Corpus Workbench version 4: the future
Although we hope that Ziggurat will be broadly useful, its primary purpose is to power CWB v 4.
CWB 4 will be a complete re-design that uses Ziggurat to improve flexibility and scalability. Some design goals:
- Lift the limit on corpus size in CWB v 3 (2.1 billion words)
- Better support for querying hierarchical (XML) markup
- Better support for querying dependency-annotated corpora
- Better support for complex annotation values (e.g. sets)
- Easier programmatic access to CQP queries
- More robust corpus management (goodbye registry directory)
- Rebalanced trade-off between disk space and processor time
Work on CWB v 4 will begin after the 1.0 release of Ziggurat itself.
The Ziggurat data model
- Ziggurat data model and file format, version 1.0 (PDF)
Comments are welcome! You might want to read Evert & Hardie (2015) first to get an overview.
New versions of this and other Ziggurat documents will be published soon. As always code will appear on the SourceForge repository as we write it.
How to install
Ziggurat is not end user software. Most people will never need to install it. Only developers need to consider doing so.
Read more
To learn more about Ziggurat, see:
- Evert, S. and Hardie, A. (2015). Ziggurat: A new data model and indexing format for large annotated text corpora. In Proceedings of the 3rd Workshop on the Challenges in the Management of Large Corpora (CMLC-3), pages 21–27, Lancaster, UK. (PDF)
- Evert, S. and Hardie, A. (2021). Ziggurat v0.1: A next-generation system for modelling, storing, and retrieving corpus (and other) data. Presentation at Corpus Linguistics 2021, Limerick (online). (Slides, Video on YouTube)