Data in quantitative comparative linguistics

Agenda 2021

Cross-Linguistic Data Formats
CLDFBench
Cross-Linguistic Transcription Systems
Database of Cross-Linguistic Colexifications
Retro-Standardizing the TPPSR

Cross-Linguistic Data Formats

Idea

linguists publish increasing amounts of digital data
most data published are not comparable, as they are rarely standardized
→ standardizing data that has been published and encouraging scholars to standardize data along with its publication would help to increase the amount of comparable data out there

Cross-Linguistic Data Formats

CLDF

CLDF initiative goes back to 2014 (founded at the DLCE/MPI-SHH Jena)
idea to use tabular data throughout
employ the CSVW model by the W3C
use Python packages for validation
allow easy parsing / querying with Python, R, and SQL

Cross-Linguistic Data Formats

Propagation Efforts

website and first version published in 2018 (https://cldf.clld.org) along with a paper (see details on the website)
provide concrete examples how to present data in this form (e.g., by publishing blog posts at https://calc.hypotheses.org, or by mentioning CLDF in papers and providing data in CLDF, e.g., Sagart et al. 2019, PNAS, with lexical data at https://github.com/lexibank/sagartst)
two main organizations: github.com/lexibank and github.com/cldf-datasets, one for lexical data, one for structural data

Cross-Linguistic Data Formats

tabular data in CSV form with metadata in JSON and sources in BibTeX, following the recommendations of the W3C for CSVW (tabular data on the web, 2013)
pycldf package to query the data and to validate and to convert to sqlite
integration of reference catalogs (Glottolog for languages, Concepticon for concepts, and CLTS for speech sounds, maybe Grammaticon for IGT in the future)

CLDFBench

Idea

provide a Python package that does the lifting of data
retro-standardization and conversion of data from other formats can be done with Python code that is testable, modularizable, and transparent
teach more and more people to work with CLDF

CLDFBench

Application

CLDFBench now underlies some 70 lexical and about 30 (?) structural datasets
new CLDF datasets are all curated by means of CLDFbench instead of "pure" pycldf as before
examples of how to use the library are published in various forms (e.g., also as blog posts at https://calc.hypotheses.org)

CLDFBench

Example

query WALS data with CLDF and CLDFBench
blog post published today at https://calc.hypotheses.org
explains how WALS data can be turned into tabular form of one single table to be used in other applications

Cross-Linguistic Transcription Systems

Idea

we need a unified version of IPA
but nobody wants to use a unified IPA
so we can make a meta-version that allows us to unify something like an underlying collection of speech sounds, defined by distinctive features, which can then be translated into various transcription systems
we add support (similar to Concepticon) for transcription datasets on top, by making an explicit linking of transcription data (Phoible, etc.) to our catalog
we provide Python support for various operations (compare phoneme inventories for their similarity, etc.)

Cross-Linguistic Transcription Systems

Implementation

pyclts (https://github.com/clts-cldf/pyclts) is the Python package (List et al. 2020) that provides access to the operations and also is used to check the data
clts (https://github.com/clts-cldf/clts) is the data, which we curate with pycldf and release in circles of ideally one per year (List et al. 2021)
https://clts.clld.org shows the data and allows scholars to inspect it
https://digling.org/calc/clts/ is a reduced app that can be used to query the data for specific aspects
study by Anderson et al. (2018) presents the major idea and the algorithm (see the main CLLD website)

Cross-Linguistic Transcription Systems

Integration in CLDF

use Orthography Profiles (Moran and Cysouw 2018) to convert data to CLTS
see the JavaScript version of the orthography profiles package (segments package in Python, see https://github.com/cldf/segments) for detailed ideas on orthography profiles: https://digling.org/calc/profile/ (written by myself)
orthography profiles are available from within the CLDFBench code for the conversion of lexical datasets (support also via LingPy etc.)
main idea: do not segment your data on the fly, make it once, make it well, then you can use your data for many purposes afterwards!

Cross-Linguistic Transcription Systems

Application

we have a study on phoneme inventory systems currently under review
we show that the correlation between phoneme inventories collected for different datasets (Phoible, UPSID, LAPSYD, etc.) with respect to certain aspects (inventory size, number of consonants, number of vowels) is disappointingly low (0.50-0.85)
we show that linking to CLTS makes a difference and makes the differences smaller, but they are still huge
we identify several problems (vowel length, consonant length, what counts as one phoneme) that lead to what at times seems to be arbitrary decisions

Database of Cross-Linguistic Colexifications

Idea

assemble lexical datasets coded in CLDF to extract colexification patterns automatically
use this to replace the not-so-easy-maintainable CLICS-1.0 database (List et al. 2014)
restrict the curation of the data to the selection of a couple of base datasets (currently 30, Rzymski et al. 2020, see CLICS homepage at https://clics.clld.org)
add minimal Python code to infer colexifications from the data
represen the data as a CLLD app

Database of Cross-Linguistic Colexifications

Application

Jackson et al. (Science, 2019) show that emotion semantics vary across language families, using colexification networks and emotion concept communities from the CLICS 3 database
many more applications are possible, people become increasingly aware of CLICS

Database of Cross-Linguistic Colexifications

Future Ideas

expand the data base by adding at least 10 more datasets for CLICS 4.0
add specific queries for WALS-like lexical features (arm/hand, color distinctions, etc.)
allow to compute these features automatically and then check to which degree they correlate with the WALS data and other datasets ("automatic feature extraction")

Retrostandardizing the TPPSR

Idea

TPPSR is a dialect atlas of the Suisse romands, collected in the early 20th century
the dialects have now disappeared, the data was never published in maps, only in tables
standardizing the data would allow us to both publish the maps but also to use the data both for qualitative and for quantitative studies

Retrostandardizing the TPPSR

Practice

use CLDFBench, link concepts to Concepticon, language varieties to Glottolog, and sounds to CLTS
test the CLDF dataset rigorously
make a CLLD app at https://tppsr.clld.org

Retrostandardizing the TPPSR

Summary

although tedious, due to the very complex transcription system, the CLLD app allows us to fulfil the dream of the TPPSR editors (Gauchat et al. 1925): to put their data into geographic space
retro-standardization has a lot of potential for many projects
the good thing is that we do not touch any original data with CLDF/CLDFBench, we only provide the code to represent the data in another format and also add a substantial amount of external information in a systematic manner

Thanks for Your Attention!

Data in quantitative comparative linguistics

Agenda 2021

Cross-Linguistic Data Formats

Cross-Linguistic Data Formats

Idea

Cross-Linguistic Data Formats

CLDF

Cross-Linguistic Data Formats

Propagation Efforts

Cross-Linguistic Data Formats

Contents

CLDFBench

CLDFBench

Idea

CLDFBench

Application

CLDFBench

Example

Cross-Linguistic Transcription Systems

Cross-Linguistic Transcription Systems

Idea

Cross-Linguistic Transcription Systems

Implementation

Cross-Linguistic Transcription Systems

Integration in CLDF

Cross-Linguistic Transcription Systems

Application

Database of Cross-Linguistic Colexifications

Database of Cross-Linguistic Colexifications

Idea

Database of Cross-Linguistic Colexifications

Application

Database of Cross-Linguistic Colexifications

Future Ideas

Retrostandardizing the TPPSR

Retrostandardizing the TPPSR

Idea

Retrostandardizing the TPPSR

Practice

Retrostandardizing the TPPSR

Summary