Phonological Corpora of Modern Czech

Go to: Lexical Corpus | Textual Corpus | Publications | Evaluation Program | Contact



Introduction

These are the web pages of the project Issues in the Phonology of Word in Czech (13-15361P) supported by the Grant Agency of the Czech Republic (2013-2015). The project's goal was to account for various aspects of the phonology of words in modern Czech. In particular, it concentrated on the phonotactic aspect of Czech words (phoneme occurrence and phoneme frequency, phoneme combinations and the syllabic structure of words). The project resulted in two phonological corpora of Czech: the Lexical Corpus and the Textual Corpus. Other results are a series of publications.

Description of the Corpora (last updated: 29/06/2016)

Abbreviations and symbols used in the Corpora and the evaluation files

Phonological Lexical Corpus

A phonologically transcribed and annotated database of the Czech vocabulary (ie. of lemmas / dictionary entries) stored in a csv file (a comma-separated format file edittable e.g. by MS Excel). The transcription reflects the phonematic constituency of words and their syllabic structure ("syllabification of words"). The Corpus also includes an allophonic transcription showing an idealized pronunciation of a given lexical item. The main corpus is supplemented with several smaller lexical corpora with proprial vocabulary.

At the moment, the Lexical Corpus currently contains:

The source are the major dictionaries of Czech included in the Database of Glossaries (except for Výslovnost spisovné češtiny, which was added separately):

Download the Lexical Corpus (zip/csv, last updated: 29/06/2016)

Quantitative analysis of the Lexical Corpus

The whole Lexical Corpus was quantitatively evaluated for frequencies of various phonological units, in particular the phonological word. See the Description of the Corpora for the explanation of this notion. The evaluation was achieved with the Evaluation program.

· Complete quantitative analysis (zip, multiple csv files)

Selected frequency tables (all values are given for phonological words):

Phonemes and allophones

Non-nuclear combinations ("consonant clusters")

Two-phoneme combinations of various types

Word structure

Phonotagm (syllable)

Proprial lexical sub-corpora

The main lexical corpus is supplemented with several subcorpora:

Names of municipalities and their parts (zip/csv, last updated: 29/06/2016)
· 15,051 names of the Czech municipalities and their parts existing by the end of 2013; it was analyzed and described in the paper Kvantitativní fonotaktická analýza názvů českých obcí a jejich částí (2015) (see Publications). Some minor corrections has been made in the Corpus since the publication of the paper.

· Complete analysis (zip, multiple csv files)

Most common male and female names and their hypocoristic forms (zip/csv, last updated: 29/06/2016)
· 5,724 items

Botanical names (zip/csv, last updated: 29/06/2016)
· 2,549 items

Zoological names (zip/csv, last updated: 29/06/2016)
· 4,517 items

SSČ

Lexemes from Slovník spisovné češtiny, 4th edition, 2005 (zip/csv, last updated: 29/06/2016)

This sample (49,506 items) was analyzed and described in the papers Corpus-based analysis of the Czech syllable (2014), Kvantitativní analýza slabiky v českém lexikonu (2015), Kvantitativní fonotaktická analýza názvů českých obcí a jejich částí (2015) (see Publications). Some minor corrections has been made in the Corpus since the publication of the papers.

· Complete analysis (zip, multiple csv files)

Phonological Textual Corpus

The Textual Corpus consists of a selection of phonologically transcribed Czech texts stored in xml files. The texts are mostly Czech novels in public domain (see here for the list of the currently included texts). Like in the case of the Lexical Corpus, the transcription reflects the phonematic constituency of words (and sentences) and their syllabic structure. The Corpus also includes an allophonic transcription showing the idealized pronunciation of the sentences. In addition, the transcription takes into account the neutral prosodic organization of words within sentences. It was automatically assigned on the basis of the rules proposed by Zdena Palková for the automatic TSS synthesis of Czech. See Description of the Corpora for more details.

At the moment, the Textual Corpus contains:

Download the Textual Corpus (zip/xml, last updated: 29/06/2016)

Quantitative analysis of the Textual Corpus

The whole Textual Corpus was quantitatively evaluated for frequencies of various phonological units. See below for the evaluation computer program.

· Complete quantitative analysis (zip, multiple csv files)

Selected frequency tables (all values are given for phonological words):

Phonemes and allophones

Non-nuclear combinations ("consonant clusters")

Two-phoneme combinations of various types

Word structure

Phonotagm (syllable)

Evaluation Program

The corpura were and can be statistically analyzed by using a computer program written by Renato Müller.

Download the Evaluation Program, version 1.0 (zip/exe, last updated: 29/06/2016)

In case of problems, contact us.

Publications

The following are the publications written under the auspices of the project or the works relying on data from the Corpora:

2014

· Bičan, Aleš // Word Phonology in Czech // Czech Language News (spring 2014)

· Bičan, Aleš // K pojmu fonologické slovo v češtině // Sophia Slavica, Sborník prací věnovaných PhDr. Žofii Šarapatkové k osmdesátým narozeninám (eds. Vít Boček - Bohumil Vykypěl), Brno: Tribun 2014, pp. 13-23 // download

· Bičan, Aleš // Nuclearity of /r/ and /l/ in Czech // New Insights into Slavic Linguistics (eds. Jacek Witkós - Sylwester Jaworski), Frankfurt am Main: Peter Lang 2014, pp. 21-33 // download // syllabicity test (referred to in the paper)

2015

· Bičan, Aleš // Distribution of vocalic quantity in Czech // Grazer Linguistische Studien 83, 2015, pp. 133-138 // download // supplementary data (referred to in the paper)

· Bičan, Aleš // Kvantitativní fonotaktická analýza názvů českých obcí a jejich částí // Slovo a slovesnost 76/4, 2015, pp. 243-264 // download // Appendix 1 // Appendix 2 // Appendix 3 // Appendix 4 (see also above the analysis of the corpus)

· Bičanová, Lenka - Bičan, Aleš // Nástin typologie fonologických změn na úrovni slova // Linguistica Brunensia 63/2, 2015, pp. 7-25 // download

· Bičan, Aleš // Kvantitativní analýza slabiky v českém lexikonu // Linguistica Brunensia 63/2, 2015, pp. 87-107 // download

· Bičan, Aleš // Corpus-based analysis of the Czech syllable // Beiträge der Europäischen Slavistischen Linguistik (POLYSLAV) 18 (eds. Guetiérrez Rubio, Enrique et al.), München - Berlin - Washington: Harrasowitz Verlag 2015, 26-36 // download

· Bičan, Aleš // Fonologický lexikální korpus češtiny a slabičná struktura českého slova // Bohemica Olomucensia 7/3-4, 2015, pp. 45-59 // download // supplementary data

Contact

The project was carried out and these papers are maintained by Aleš Bičan, the Institute of the Czech Language of the Academy of Sciences of the Czech Republic.

Aleš Bičan
Ústav pro jazyk český AV ČR, v. v. i.
Veveří 97
60200 Brno
Czech Republic

email: bican@phil.muni.cz