The Corpus of Czech verse (CCV, henceforth) is a lemmatized, phonetically, morphologically, metrically and strophically annotated corpus.*
Each lexical unit is provided with information about its basic word form (lemma), phonetic transcription and grammatical categories; each verse line is provided with information about its type of metre (iamb, trochee, etc.), length (n-foot), type of the end of a line (masculine, feminine, etc.) and the metrical pattern. (Currently, only syllabotonic verse lines are annotated in terms of metrics.) On higher levels rhyme pairs are annotated, or n-some and fixed forms (sonnet, rondel, etc.). In the metrical and strophical description it is possible to search by means of Database of Czech Metres; the lemmatization level is partly accessible through Frequency lists; rhyme pairs can be searched in the application Gunstick
CCV is based on the texts from the Czech electronic library, which, however, contains a number of duplicates (i.e. recurrence of poems in various editions of a collection or collected writings of an author). To avoid unnecessary misrepresentation of statistical data, we decided to include into CCV only the oldest occurrence of each poem (the inventory of discarded poems), the correspondence between the poems being determined on the basis of their phonetic transcription. Thus, selection should not be affected by variations in punctuation, and at the same time there should not occur elimination of those reprints in which certain (albeit minor) changes had been made.
Each lexical unit is provided with information about its basic word form (lemma), phonetic transcription and grammatical categories; each verse line is provided with information about its type of metre (iamb, trochee, etc.), length (n-foot), type of the end of a line (masculine, feminine, etc.) and the metrical pattern. (Currently, only syllabotonic verse lines are annotated in terms of metrics.) On higher levels rhyme pairs are annotated, or n-some and fixed forms (sonnet, rondel, etc.). In the metrical and strophical description it is possible to search by means of Database of Czech Metres; the lemmatization level is partly accessible through Frequency lists; rhyme pairs can be searched in the application Gunstick
CCV is based on the texts from the Czech electronic library, which, however, contains a number of duplicates (i.e. recurrence of poems in various editions of a collection or collected writings of an author). To avoid unnecessary misrepresentation of statistical data, we decided to include into CCV only the oldest occurrence of each poem (the inventory of discarded poems), the correspondence between the poems being determined on the basis of their phonetic transcription. Thus, selection should not be affected by variations in punctuation, and at the same time there should not occur elimination of those reprints in which certain (albeit minor) changes had been made.
* Lemmatization and morphological annotation were carried out by the researchers at the Institute of Theoretical and Computational Linguistics FA CU (Hana Skoumalová, Milena Hnátková, Tomáš Jelínek and Vladimír Petkevič) in cooperation with the researchers at the Institute of Formal and Applied Linguistics FMP CU (Jan Hajič, Jaroslava Hlaváčová).
The basic characteristics of the Corpus of Czech verse
- 1 689 poetry collections
- 76 699 poems
- 2 664 989 verse lines
- 14 592 037 words