Data in quantitative comparative linguistics
Data in quantitative comparative linguistics
Agenda 2021
- Cross-Linguistic Data Formats
- CLDFBench
- Cross-Linguistic Transcription Systems
- Database of Cross-Linguistic Colexifications
- Retro-Standardizing the TPPSR
Cross-Linguistic Data Formats
Cross-Linguistic Data Formats
Idea
- linguists publish increasing amounts of digital data
- most data published are not comparable, as they are rarely standardized
- → standardizing data that has been published and encouraging scholars to standardize data along with its publication would help to increase the amount of comparable data out there
Cross-Linguistic Data Formats
CLDF
- CLDF initiative goes back to 2014 (founded at the DLCE/MPI-SHH Jena)
- idea to use tabular data throughout
- employ the CSVW model by the W3C
- use Python packages for validation
- allow easy parsing / querying with Python, R, and SQL
Cross-Linguistic Data Formats
Propagation Efforts
- website and first version published in 2018 (https://cldf.clld.org) along with a paper (see details on the website)
- provide concrete examples how to present data in this form (e.g., by publishing blog posts at https://calc.hypotheses.org, or by mentioning CLDF in papers and providing data in CLDF, e.g., Sagart et al. 2019, PNAS, with lexical data at https://github.com/lexibank/sagartst)
- two main organizations: github.com/lexibank and github.com/cldf-datasets, one for lexical data, one for structural data
Cross-Linguistic Data Formats
Contents
- tabular data in CSV form with metadata in JSON and sources in BibTeX, following the recommendations of the W3C for CSVW (tabular data on the web, 2013)
- pycldf package to query the data and to validate and to convert to sqlite
- integration of reference catalogs (Glottolog for languages, Concepticon for concepts, and CLTS for speech sounds, maybe Grammaticon for IGT in the future)
CLDFBench
CLDFBench
Idea
- provide a Python package that does the lifting of data
- retro-standardization and conversion of data from other formats can be done with Python code that is testable, modularizable, and transparent
- teach more and more people to work with CLDF
CLDFBench
Application
- CLDFBench now underlies some 70 lexical and about 30 (?) structural datasets
- new CLDF datasets are all curated by means of CLDFbench instead of "pure" pycldf as before
- examples of how to use the library are published in various forms (e.g., also as blog posts at https://calc.hypotheses.org)
CLDFBench
Example
- query WALS data with CLDF and CLDFBench
- blog post published today at https://calc.hypotheses.org
- explains how WALS data can be turned into tabular form of one single table to be used in other applications
Cross-Linguistic Transcription Systems
Cross-Linguistic Transcription Systems
Idea
- we need a unified version of IPA
- but nobody wants to use a unified IPA
- so we can make a meta-version that allows us to unify something like an underlying collection of speech sounds, defined by distinctive features, which can then be translated into various transcription systems
- we add support (similar to Concepticon) for transcription datasets on top, by making an explicit linking of transcription data (Phoible, etc.) to our catalog
- we provide Python support for various operations (compare phoneme inventories for their similarity, etc.)
Cross-Linguistic Transcription Systems
Implementation
- pyclts (https://github.com/clts-cldf/pyclts) is the Python package (List et al. 2020) that provides access to the operations and also is used to check the data
- clts (https://github.com/clts-cldf/clts) is the data, which we curate with pycldf and release in circles of ideally one per year (List et al. 2021)
- https://clts.clld.org shows the data and allows scholars to inspect it
- https://digling.org/calc/clts/ is a reduced app that can be used to query the data for specific aspects
- study by Anderson et al. (2018) presents the major idea and the algorithm (see the main CLLD website)
Cross-Linguistic Transcription Systems
Integration in CLDF
- use Orthography Profiles (Moran and Cysouw 2018) to convert data to CLTS
- see the JavaScript version of the orthography profiles package (segments package in Python, see https://github.com/cldf/segments) for detailed ideas on orthography profiles: https://digling.org/calc/profile/ (written by myself)
- orthography profiles are available from within the CLDFBench code for the conversion of lexical datasets (support also via LingPy etc.)
- main idea: do not segment your data on the fly, make it once, make it well, then you can use your data for many purposes afterwards!
Cross-Linguistic Transcription Systems
Application
- we have a study on phoneme inventory systems currently under review
- we show that the correlation between phoneme inventories collected for different datasets (Phoible, UPSID, LAPSYD, etc.) with respect to certain aspects (inventory size, number of consonants, number of vowels) is disappointingly low (0.50-0.85)
- we show that linking to CLTS makes a difference and makes the differences smaller, but they are still huge
- we identify several problems (vowel length, consonant length, what counts as one phoneme) that lead to what at times seems to be arbitrary decisions
Database of Cross-Linguistic Colexifications
Database of Cross-Linguistic Colexifications
Idea
- assemble lexical datasets coded in CLDF to extract colexification patterns automatically
- use this to replace the not-so-easy-maintainable CLICS-1.0 database (List et al. 2014)
- restrict the curation of the data to the selection of a couple of base datasets (currently 30, Rzymski et al. 2020, see CLICS homepage at https://clics.clld.org)
- add minimal Python code to infer colexifications from the data
- represen the data as a CLLD app
Database of Cross-Linguistic Colexifications
Application
- Jackson et al. (Science, 2019) show that emotion semantics vary across language families, using colexification networks and emotion concept communities from the CLICS 3 database
- many more applications are possible, people become increasingly aware of CLICS
Database of Cross-Linguistic Colexifications
Future Ideas
- expand the data base by adding at least 10 more datasets for CLICS 4.0
- add specific queries for WALS-like lexical features (arm/hand, color distinctions, etc.)
- allow to compute these features automatically and then check to which degree they correlate with the WALS data and other datasets ("automatic feature extraction")
Retrostandardizing the TPPSR
Retrostandardizing the TPPSR
Idea
- TPPSR is a dialect atlas of the Suisse romands, collected in the early 20th century
- the dialects have now disappeared, the data was never published in maps, only in tables
- standardizing the data would allow us to both publish the maps but also to use the data both for qualitative and for quantitative studies
Retrostandardizing the TPPSR
Practice
- use CLDFBench, link concepts to Concepticon, language varieties to Glottolog, and sounds to CLTS
- test the CLDF dataset rigorously
- make a CLLD app at https://tppsr.clld.org
Retrostandardizing the TPPSR
Summary
- although tedious, due to the very complex transcription system, the CLLD app allows us to fulfil the dream of the TPPSR editors (Gauchat et al. 1925): to put their data into geographic space
- retro-standardization has a lot of potential for many projects
- the good thing is that we do not touch any original data with CLDF/CLDFBench, we only provide the code to represent the data in another format and also add a substantial amount of external information in a systematic manner
Thanks for Your Attention!