CLTS


Establishing a Cross-Linguistic Database of Phonetic Notation Systems

image

Agenda 2017

  • Comparability of Cross-Linguistic Data
  • Phonetic Transcription Systems
  • Cross-Linguistic Data Formats
  • Cross-Linguistic Transcription Systems

Comparability of Linguistic Data

img

Comparability of Linguistic Data

Background

  • linguistics is a data-driven discipline
  • every linguist out there profits or has profited from colleagues sharing their data
  • we still mess it up oftentimes

Comparability of Linguistic Data

Examples

  • grammatical description and comparison
  • phonetic description and historical comparison
  • lexical description and lexical comparison

Comparability of Linguistic Data

Problems

  • availability
  • transparency
  • comparability

Comparability of Linguistic Data

Problems: Availability

trump

Be careful, this presentation may contain fake news!

Comparability of Linguistic Data

Problems: Transparency

trump

Taken from a blog by Bengtson (2017) at http://euskararenjatorria.net/?p=26071

Comparability of Linguistic Data

Problems: Comparability

trump

Comparing different reconstruction systems for Old Chinese (List et al. 2017)

Phonetic Transcription Systems

img

Phonetic Transcription Systems

Background

Phonetic Transcription Systems

IPA

  • "IPA inside" may mean different things
  • "IPA" itself creates ambiguities
  • "IPA" is not a standard, as it does not provide evaluation tools, but instead a set of suggestions
  • "IPA" suggestions are disregarded and ignored by linguists in multiple ways
  • see Moran and Cysouw (2017) for details on Unicode and IPA pitfalls...

Phonetic Transcription Systems

Comparative Databases

Phonetic Transcription Systems

Comparative Databases

  • Wikipedia has a large collection of Language phonologies in which the IPA is supposed used, but which are not formally tested
  • field workers all of the world produce data in transcriptions that are supposed to conform to the IPA, but they often largely differ regarding their respective strictness of adhering to IPA
  • the ASJP project designed a short alphabet to gather rough transcriptions of lexical items of different languages of the world
  • Fonetikode (Dediu and Moisik 2016) is an attempt to link Ruhlen's and Phoible's sounds to a new feature system, but they do not use the original Ruhlen data and do not provide annotations for all symbols

Phonetic Transcription Systems

Comparative Databases

Dataset Transcr. Syst. Sounds
GLD (Ruhlen 2008) NAPA (modified) 600+ (?)
Phoible (Moran et al. 2015) IPA (specified) 2000+
GLD (Starostin 2015) UTS ?
ASJP (Wichmann et al. 2016) ASJP Code 700+
PBase (Mielke 2008) IPA (specified) 1000+
Wikipedia IPA (unspecified) ?
JIPA IPA (norm?) 800+

Cross-Linguistic Data Formats

img

Cross-Linguistic Data Formats

Background

The Cross-Linguistic Data Formats initiative (Forkel et al. 2016, http://cldf.clld.org) comes along with:

  • standardization efforts (linguistic meta-data-bases like Glottolog, Concepticon, and CLTS)
  • software APIs which help to test and use the data
  • working examples for best practice

Cross-Linguistic Data Formats

Technical Aspects

  • See http://github.com/glottobank/cldf for details, discussions, and working examples.
  • Format for machine-readable specification is CSV with metadata in JSON, following the W3C’s Model for Tabular Data and Metadata on the Web (http://www.w3.org/TR/tabular-data-model/).
  • CLDF ontology builds and expands upon the General Ontology for Linguistic Description (GOLD).
  • pcldf API in Python is close to first release and can be used to test datasets whether they conform to CLDF

Cross-Linguistic Data Formats

Standards

  • Wordlist standard (integrated into various tools like LingPy, Beastling, and EDICTOR)
  • Dictionary standard (will be the basic for the Dictionaria project, http://dictionaria.clld.org)
  • Feature standard (basic ways to handle grammatical features in cross-linguistic datasets)

Cross-Linguistic Data Formats

Meta-Data-Bases

  • Concepticon (List et al. 2016) handles concepts across different datasets and questionnaires.
  • Glottolog (Hammarström et al. 2017) helps to handle languages via unique identifiers.
  • CLTS (this talk) is supposed to provide the missing standard for the handling of phonetic transcription systems by providing unique identifiers across distinct sounds which can be found across linguistic datasets

Cross-Linguistic Transcriptions

img

Cross-Linguistic Transcriptions

Objectives

  • provide a standard for phonetic transcription for the purpose of cross-linguistic studies
  • standardized ways to represent sound values serve as "comparative concepts" in the sense of Haspelmath (2010)
  • similar to the Concepticon, we want to allow to register different transcription systems but link them with each other by linking each transcription system to unique sound segments

Cross-Linguistic Transcriptions

Objectives

  • in contrast to Phoible or other databases which list solely the inventories of languages, CLTS is supposed to serve as a standard for the handling of lexical data in the CLDF framework, as a result, not only sound segments need to be included in the framework, but also ways to transcribe lexical data consistently

Cross-Linguistic Transcriptions

Strategy

  • register transcription systems by linking the sounds to phonetic feature bundles which serve as identifier for sound segments
  • apply a three-step normalization procedure that goes from (1) NFD-normalization (Unicode decomposed characters), (2) via Unicode confusables normalization, to (3) dedicated Alias symbols
  • divide sounds in different sound classes (vowel, consonant, diphthong, cluster, click, tone) to define specific rules for their respective feature sets

Cross-Linguistic Transcriptions

Strategy

  • allow for a quick expansion of the set of features and the sound segments for each alphabet by applying a procedure that tries to guess unknown sounds by decomposing them into base sounds and diacritics
  • use the feature bundles and the different transcription systems to link the transcription systems with various datasets, like Phoible, LingPy's sound class system, Wikipedia's sound descriptions, or the binary feature systems published along with PBase
  • features are not ambitious in the sense of being minimal, ordered, exclusive, binary, etc., but serve as a means of description, following the IPA as closely as possible

Cross-Linguistic Transcriptions

Examples: Three-Step-Normalization

In NFD Confus. Alias Out
ã (U+00E3) a (U+0061) ◌̃ (U+0303)
a (U+0061) : (U+003a) a (U+0061) ː (U+02d0)
ʦ (U+02a6) t (U+0074) s (U+0073) ts

Cross-Linguistic Transcriptions

Examples: Three-Step-Normalization

In Identifier
ã nasalized unrounded open front vowel
long unrounded open front
ts voiceless alveolar affricate consonant

Cross-Linguistic Transcriptions

Statistics: Now

  • two transcription systems: ASJP and B(road)IPA
  • three metadata sets: Phoible, LingPy, Wikipedia
  • Python API works on Python3
  • Web-Application allows to check whether data conforms to BIPA or not

Cross-Linguistic Transcriptions

Statistics: Now

Dataset Matched Generated Missed Perc.
Phoible 613 616 772 61%
JIPA (142 lang.) 515 2 377 58%
PBase 496 265 521 59%

Cross-Linguistic Transcriptions

Statistics: Planned

  • four more transcription systems (UPA, NAPA, GLD-UTS, X-SAMPA)
  • more metadata (Index Diachronica, Ruhlen's Database, sound examples)
  • full platform and all-version Python API
  • enhanced web-application (select between transcription systems, translate, etc.)

Outlook

img

Outlook

Why?

  • corpus-based sound inventories
  • increase comparability of lexical data
  • obtain new insights into sound change
  • evaluate consistency of linguistic data

img

img

img

Thanks for Your Attention!

image