Phonological Corpus of Czech

1 Introduction

The Phonological Corpus of Czech is a bundle of phonologically trascribed databases (sub-corpora) of various Modern Czech word lists and texts. It was originally developed under the auspices of the project Issues in the Phonology of Word in Czech (13-15361P) of the Grant Agency of the Czech Republic in 2013-2015. The first version was published online in 2016 together with its quantitative analysis (phoneme frequency, phoneme combinations, and the syllabic structure of words; see Sections 2.2 and 3.1). The Lexical Sub-corpus has been corrected and enlarged (see Sections 2.5 and 2.6). A new version is to be made public once it reaches a more definitive form.

Description of the Corpus (last updated: 28/07/2020)

Abbreviations and symbols used in the Corpus and the evaluation files

2 Lexical Sub-corpus

A phonologically transcribed and annotated database of the Czech vocabulary (ie. of lemmas) stored in a csv file (a comma-separated format file edittable e.g. by MS Excel). The transcription reflects the phonematic constituency of words and their syllabic structure ("syllabification"). The Corpus also includes an allophonic transcription showing an idealized pronunciation of a given lexical item. The main corpus is supplemented with several smaller lexical corpora (Section 2.3).

The Lexical Sub-corpus consists of two parts: the main Dictionary Database (words from major dictionaries), and additional Terminological Databases (word lists of proper names and terms). Both parts have been continuously corrected since 2016 (misprints and transcription errors). For the further development of the Lexical Sub-corpus, see Sections 2.5 and 2.6.

2.1 The Dictionary Database, version 2016

  • 275,779 lexical items // 288,243 phonological words (279,826 unique orthographical words)
  • 1,029,536 phonotagms (syllables)
  • 2,570,372 phonemes

The lexical items were taken from the following major dictionaries of Czech included in the Database of Glossaries (except for Výslovnost spisovné češtiny, which was added separately):

  • Příruční slovník jazyka českého (PSJČ; 1935–1957)
  • Slovník spisovného jazyka českého (SSJČ; 2nd edition, 1989)
  • Slovník spisovné češtiny (SSČ; 4th edition, 2005)
  • Co v slovnících nenajdete (Novinky v současné slovní zásobě) (CSN; 1994)
  • Nová slova v češtině. Slovník neologizmů 1, 2 (SN; 1998, 2004)
  • Slovesa pro praxi. Valenční slovník nejčastějších českých sloves (1997)
  • Slovník slovesných, substantivních a adjektivních vazeb a spojení (2005)
  • Frekvenční slovník češtiny (FSČ; 2004)
  • Akademický slovník cizích slov A-Ž (NASCS; 1995)
  • Výslovnost spisovné češtiny (VSČ2; 1978)

Download the Dictionary Database, Version 2016 (zip/csv, uploaded: 29/06/2016)

2.2 Quantitative analysis of the Dictionary Database

The Dictionary Database was quantitatively analyzed for frequencies of various phonological units, in particular the phonological word. See the Description of the Corpus for the explanation of this notion.

· Complete quantitative analysis (zip, multiple csv files)

Selected frequency tables (all values apply to phonological words):

Phonemes and allophones

Non-nuclear combinations ("consonant clusters")

Two-phoneme combinations of various types

Word structure

Phonotagm (syllable)

2.3 Additional Terminological Databases, version 2016

Names of municipalities and their parts (zip/csv, version 2016, uploaded: 29/06/2016)
· 15,051 names of the Czech municipalities and their parts existing by the end of 2013; it was analyzed and described in the paper Kvantitativní fonotaktická analýza názvů českých obcí a jejich částí (2015) (see Publications). Some minor corrections has been made in the Corpus since the publication of the paper.

· Complete analysis (zip, multiple csv files)

Most common male and female names and their hypocoristic forms (zip/csv, version 2016, uploaded: 29/06/2016)
· 5,724 items

Botanical names (zip/csv, version 2016, uploaded: 29/06/2016)
· 2,549 items

Zoological names (zip/csv, version 2016, uploaded: 29/06/2016)
· 4,517 items

2.4 SSČ

Lexemes from Slovník spisovné češtiny, 4th edition, 2005 (zip/csv, uploaded: 29/06/2016)

This sample (49,506 items) was analyzed and described in the papers Corpus-based analysis of the Czech syllable (2014), Kvantitativní analýza slabiky v českém lexikonu (2015), Kvantitativní fonotaktická analýza názvů českých obcí a jejich částí (2015) (see Publications). Some minor corrections have been made in the Database since the publication of the papers.

· Complete analysis (zip, multiple csv files)

2.5 Lexical Sub-corpus, version 2018

In 2018 a major update was made resulting in Version 2018, which remains unpublished, but was partly analyzed in Bičan (2020a, b, c), and Bičan et al. (2020) -- see Publications.

New databases:

  • Most common surnames (13,021 items)
  • Names of castles and chateaus (1,211 items)
  • Word forms excerpted from the Czech National Corpus, SYN (17,955 items)
  • Recent English loanwords not recorded in any dictionary at that time (392 items)

The vocabulary of Slovník spisovné češtiny (2005) was further divided into the Native Word Database (33,966 items), and the Loanword Database (36,567). The latter was supplemented by The Phonological Database of Czech Anglicisms (Bičan et al. 2020) and by the loanwords recorded in the loanword dictionaries NASCS and VSČ2 as well as the Czech National Corpus (for additional details, see Bičan 2020c).

2.6 Lexical Sub-corpus, version 2020

The 2018 version was made obsolate by another major update in 2020 prepared for the monograph Slabika a její hranice v češtině [Syllable and its boundary in Czech] (Šturm - Bičan 2021; see Publications). The book has worked out a new complete syllabification rules for Czech, which have replaced the previous rules used for the Corpus. New databases have also been included. In total, Version 2020 consists of 461,792 unique words.

New databases:

  • Anoikonyms included in the book Geografická jména České republiky, Prague 2016 (2,310 items)
  • Adjectives included in the book Geografická jména České republiky (Prague 2016) were added to the database of the Names of municipalities and their parts (now 16,614 items)
  • Additional word forms from the Czech National Corpus, SYN2010, from the database used by Šturm, P., & Lukeš, D. 2017. "Fonotaktická analýza obsahu slabik na okrajích českých slov v mluvené a psané řeči". Slovo a slovesnost 78(2), 99–118 (cca 2,500 items)

3 Textual Sub-corpus

The Textual Sub-corpus consists of a selection of phonologically transcribed Czech texts stored in xml files. The texts are mostly Czech novels in public domain (see here for the list of the currently included texts). Like in the case of the Lexical Sub-corpus, the transcription reflects the phonematic constituency of words (and sentences) and their syllabic structure. The Sub-corpus also includes an allophonic transcription showing the idealized pronunciation of the sentences. In addition, the transcription takes into account the neutral prosodic organization of words within sentences. The latter was automatically assigned on the basis of the rules proposed by Zdena Palková for the automatic TSS synthesis of Czech. See Description of the Corpus for more details.

The Sub-corpus has not been updated since 2016.

Version 2016 contains:

  • 67 text files
  • 3,202,717 orthographic word tokens // 2,514,821 phonological word tokens (194,221 unique orthographical words // 385,570 unique phonological words)
  • 6,211,321 phonotagms (syllables)
  • 15,135,297 phonemes

Download the Textual Sub-corpus (zip/xml, version 2016, uploaded: 29/06/2016)

3.1 Quantitative analysis of the Textual Sub-corpus

The Textual Sub-corpus was quantitatively analyzed for frequencies of various phonological units.

· Complete quantitative analysis (zip, multiple csv files)

Selected frequency tables (all values apply to phonological words):

Phonemes and allophones

Non-nuclear combinations ("consonant clusters")

Two-phoneme combinations of various types

Word structure

Phonotagm (syllable)

4 Publications

5 Contact

Aleš Bičan
Ústav pro jazyk český AV ČR, v. v. i. // Czech Language Institute, Academy of Sciences of the Czech Republic
Veveří 97
60200 Brno
Czech Republic

email: bican@phil.muni.cz