Search EUROPARL corpus with simple query.

Online Help

[separator]

The Europarl CQP Demo interface uses two browser windows. The main window is split into a control frame at the top (where you can enter a corpus query and set some options) and a larger result frame (where query results are displayed). A separate context window will pop up when the links for extended match context are clicked; the context window is also used for some other auxiliary information. If you run multiple CQP Demo sessions in parallel, they will share the same context window.

The control frame: Type a simple query into the text field at the top left of the frame, and select a language with the popup menu next to it. Use the menus below the text field to match either word forms or lemmata (base forms), and to enable case-insensitive search (optionally, accents can also be ignored). The sort order option determines the ordering of query results: unsorted (i.e. in corpus order), randomised, or in various lexical orderings (ascending, descending, reverse). The latter can be based on word forms, lemma (base forms), or part-of-speech tags. The default normalised sort order (ignoring case and accents) can be deactivated with the checkbox to the right. Press the Run Query button to execute the query. Note that result sets are limited in size and the corpus search will stop after the first 50,000 matches). The display options and the buttons for frequency distributions are described below..

The query language: This Web interface uses the standard CEQL syntax for simple queries (see the examples below for a quick start). CEQL queries are entered in an intuitive notation which is translated into CQP syntax behind the scenes (you can view the generated CQP code at the bottom of the result page). Type any word form to find occurrences of this word in the corpus (some non-alphabetic characters with special meaning need to be “escaped” with a backslash, especially \? to match a question mark). You can use shell-like wildcards: ? stands for an arbitrary character, * for a substring of arbitrary length (possibly empty), and + for a nonempty substring (e.g. +ized to find word forms ending in -ized). Comma-separated lists of alternatives are enclosed in square brackets, e.g. [over,under]+ for words beginning with over- or under-. The mode setting to the right of the query text frame determines whether case and/or accents are ignored. With the default case-insensitive setting, add the modifier :C to a query word in order to match a specific case (e.g. i:C to find the letter i rather than the first-person pronoun I).

In order to match a lemma (= base form) instead of a specific inflected form, enclose the word in curly braces: {make} finds the word forms make, makes, made, making, etc. If you want to distinguish between different readings of a word – e.g. light as noun, verb or adjective – you can add a part-of-speech tag separated by an underscore: light_JJ* matches the adjective light only (note that wildcards are allowed for POS tags, too). See the links below for the tagset used in each language. A simplified, universal tagset can be accessed by enclosing the tag in curly braces, e.g. light_{A} for the adjective reading. POS tags can also be combined with lemma queries: type {light}_{V} to find all inflected forms of the verb to light.

Entering several words separated by blanks will match the corresponding sequence in the corpus. In such a sequence, a single + represents an arbitrary word and a single * an optional word. Multiple + and * can be grouped together, e.g. ++*** to skip 2 to 5 arbitrary words. If you want to match a word by its part-of-speech tag only, you can simply omit the word form part and start with an underscore: _JJS matches a superlative, and _{PRON} any pronoun. Parts of a sequence can be repeated or made optional by enclosing them in parentheses followed by a repetition marker: (…)? = optional, (…)+ = one or more repetitions, (…)* = zero or more repetitions. These constructions can be combined and nested to form very complex patterns: _{PREP} (_{ART})? (_{A})* {time}_{N} finds prepositional phrases with the head noun time. Use XML tags to match the start (<s>) or end (</s>) of a sentence, e.g. <s> but for sentences beginning with the word but.

Proximity queries: A special type of query can be used to find co-occurrences of particular words within a sentence or specified window. Connect two words (which may include wildcards, as well as lemma and/or POS tag) with <<s>> to find them in the same sentence, or <<5>> to find them within a distance of at most 5 tokens (including punctuation) – enter {cat} <<3>> {dog} to find out whether delegates talk about cats and dogs. <<3<< finds dogs and cats, but not cats and dogs. Note that in each case only the first word specified in the query will be highlighted on the result page! Multiple proximity expressions can be combined, using parentheses (…) to indicated the desired nesting, e.g. {waste}_{V} <<s>> (time <<3>> money). Keep in mind that the sequence patterns above ((…)? etc.) are not compatible with proximity queries.

Tagset information: EnglishGermanFrenchSpanish
Simple tags: N = noun, V = verb, A = adjective, ART = article, ADV = adverb, PREP = preposition, CONJ = conjunction, PRON = pronoun, $ = punctuation

[separator]

The result frame: When a query has been executed, the matching strings are displayed together with some context (by default a complete sentence) and the aligned sentence in each selected language. The matching string itself is printed in bold face and highlighted with a yellow background. Every match is preceded by a header line showing the match number followed by date and speaker information (if available). Click on the context link in the left margin to display a larger amount of context in the context window, again showing alignments for all selected languages. If there are more than 20 matches, they are displayed in pages of 20 items each. The navigation bars at the top left and bottom left allow you to step through the individual pages. Click the << and >> buttons to jump back and forth by an entire page (20 matches), respectively, or < and > to jump back and forth by half a page (10 matches). Click on < to go back to the first page and < to jump to the last page. You can also select a page from the drop down menu in the middle of the navigation bar and jump directly to this page by clicking the Go button.

The display options allow you to customise the information shown in the result frame. Note that changes in the display options only take effect when the query is re-run (query results are cached, so they can be re-displayed immediately). Alternatively, you can set display options using the menus in the top right or bottom right corner of the result frame and activate the new settings by clicking the Apply button (changes made here will be undone when a new query is executed). The leftmost display menu selects the information shown for tokens, allowing a choice of word forms with or without part-of-speech tags as subscripts, and base forms (lemma) with POS tags. The middle menu determines the amount of context shown around matches in the source language. By default, the context consists of the full sentence containing the match (sentence). It can be extended to include the preceding and following sentence (2 sentences) or a total of ten sentences (10 sentences). Alternatively, complete paragraphs or speaker turns can be displayed. The context choice does not affect target languages, which will always display the smallest unit aligned to each match. The checkboxes on the right select languages for display. Note that the current search language is always activated implicitly.

[separator]

Match frequencies: If you click the Frequencies button instead of Run Query, a list of distinct query matches and their corpus frequencies will be displayed in the result frame. Ordering and normalisation of this list can be controlled with the sort order options. Note that it is also possible to count lemmata or even (patterns of) POS tags in this way. Click on any of the strings to show the corresponding matches in the context window (you will find a navigation bar and display options at the bottom of this page). Frequency lists respect the currently selected sort order and options. In particular, matching strings will be normalised if the sort normalised box is checked.

Corpus distribution: Click on the Distribution button to show the distribution of query matches across years in the left part of the result frame, and the distribution according to the tongue of the respective speaker in the right part. Note that speaker tongue is often unspecified, so it is difficult to make use of this information. The bars in the left part are scaled to account for the slightly different number of tokens in each year. The blue percentage value at the end of a bar compares realtive frequency in a year to the average, so values above 100% indicate above-average frequency (search for coffee to see a striking example). Clicking on one any the labels will switch the result frame to show the corresponding matches. Press your browser's Back button to return to the distribution window.

[separator]

Example queries: You can copy & paste these queries into the text field in the top left corner of the control frame.

English:

    Mr* President

    _{A} energy

    [over,under]*ion

    from ( _{ART} )? ( _{A} )* ( _{N} ){1,2} to ( _{ART} )? ( _{A} )* _{N}
    

German:

    Gesetz

    [Kern,Atom]kraft+

    neben ( _{ART})? ( ( _{ADV} )? _{A} )* _{N}

    von *s_{N} wegen

    

[separator]