Chances and Challenges...


... for Computational Language Comparison

image

Agenda

  • Language Opens Windows
  • Computational Language Comparison
  • Chances
  • Challenges
  • Outlook

Language Opens Windows

img

Language Comparison Open Windows

To Our Past

Comparative linguistics has provided us with many new insights into the past of our languages.

Language Comparison Opens Windows

To Our Cognition

Comparative linguistics has a great and so far underexplored potential to help us learn more about human cognition.

Language Comparison Opens Windows

To Our Culture

Comparative linguistics can help us to distinguish culturally specific traits from universal tendencies.

Language Comparison Opens Windows

But Comparing Languages has Limits

We face many problems, resulting largely from the lack of

  • Standards
  • Agreed-Upon Methods
  • Comparable Data

Language Comparison Opens Windows

Lack of Standards

Linguists rarely follow standards in naming languages, referencing concepts, or transcribing words.

Language Comparison Opens Windows

Lack of Agreed-Upon Methods

Linguists use a large amount of different methods and barely agree even on the basic procedures for inference, as specifically witnessed by the "comparative method", which differs from scholar to scholar and has never been properly formalized.

Language Comparison Opens Windows

Lack of Comparable Data

Linguistic data for different or identical languages are largely incomparable, since key aspects of the data have often not been unified, as reflected in idiosyncratic elicitation glosses, language names, or transcription systems.

Computer-Assisted Language Comparison

img

Computational Language Comparison

Background

The Quantitative Turn in Historical Linguistics

Computational Language Comparison

Background

Expertendämmerung

Computational Language Comparison

Background

Computational Language Comparison

Computer-Assisted Language Comparison

Computer-Assisted Language Comparison (List 2016)

Computational Language Comparison

Computer-Assisted Language Comparison

Computer-Assisted Language Comparison (List 2016)

Computational Language Comparison

Computer-Assisted Language Comparison

Core ideas of the CALC framework

  • Data must be human- and machine-readable
  • Software is used to preprocess linguistic data and should specifically target linguistic problems rather than build on naive off-the-shelf solutions in machine learning
  • Interfaces help linguists to access the data and to post-process and correct machine output

Chances

img

Chances

Standardization

Retro-Standardizing Data through Data-Lifting

Chances

Standardization

(Retro)-standardization (or data lifting) is one of the core aspects of our recent efforts, expressed as part of the Cross-Linguistic Data Formats initiative (https://cldf.clld.org). This means we

  • establish and curate reference catalogs (large collections of small-scale constructs for linguistic research objects, including languages, concepts, and sounds),
  • parse digitized data semi-automatically in order to link data points to our reference catalogs,
  • use test-driven data curation to guarantee the workflow passes our tests.

Chances

Standardization

Cross-Linguistic Data Formats (Forkel et al. 2018, https://cldf.clld.org)

Chances

Standardization

Reference Catalogs

Reference catalogue for language varieties (languages and dialects), providing language identifiers, geolocations, classifications, and references (Hammarström et al. 2020, https://glottolog.org).

Chances

Standardization

Reference Catalogs

Reference catalog for concepts, which are defined independently of concrete languages, providing concept identifiers, concept metadata, concept relations, and references (List et al. 2020, https://concepticon.clld.org).

Chances

Standardization

Reference Catalogs

Reference catalogue for speech sounds (across different transcription systems and datasets), offering sound identifiers, feature-based sound descriptions, and references (List et al. 2019, https://clts.clld.org).

Chances

Annotation

Annotation helps us to add information to a dataset which was not inherently given by the structure of the data before. For these tasks, we use

  • targeted algorithms to pre-process the data, along with
  • light-weight, interative annotation tools that help us to manually correct automatically preprocessed data.

Prime examples for annotation are: morpheme segmentation (assignment of morpheme boundaries), cognate annotation (assignment of etymological relationships).

Chances

Annotation

Chances

Improved Methods

Cognates and Sound Correspondences

  • Cognate Detection: Being considered a task impossible for machines to be carried out, we are now able to detect cognates at a high level of accuracy (List et al. 2017), and even for word parts, as opposed to full words (List et al. 2016).
  • Sound Correspondence Pattern Detection: Improved methods for cognate detection and phonetic alignments have ultimately allowed us to work on first approaches to handling sound correspondence patterns across multiple languages in an unprecedented scale (List 2019).

Chances

Improved Methods

Cognates and Sound Correspondences

List et al. (2016): Partial cognate detection workflow.

Chances

Improved Methods

Cognates and Sound Correspondences

List (2019): Correspondence pattern detection.

Chances

Improved Methods

Colexification Networks

Chances

Improved Methods

Colexification Networks

List et al (2013): Using community detection methods to identify colexification networks.

Chances

Improved Methods

Colexification Networks

Chances

Improved Methods

Colexification Networks

Chances

Improved Methods

Colexification Networks

Chances

Improved Methods

Colexification Networks

Chances

Improved Methods

Colexification Networks

Chances

Improved Methods

Colexification Networks

Challenges

img

Challenges

Scientific Practice

Challenges

Scientific Practice

Too many studes are still being published without submitting data and code, increasing the amount of irreproducible data in the field of comparative linguistics.

Challenges

Modeling

Jacques and List (2019): Incomplete lineage sorting.

Challenges

Modeling

Our models are still naive, far too naive, and we need to invest much more time into a careful modeling of our problems, rather than trusting that increased computation power would help us in solving our issues.

Challenges

Data Integration

  • We still collect data for various purposes in different datasets and separate what belongs together.
  • We need to improve our frameworks for integrated data collection, where grammars and dictionaries are no longer separated from a language corpus.

Outlook

img

Outlook

  • Computational approaches to the field of historical and typological language comparison have improved a lot.
  • However, we still have a lot to do in order to expand the boundaries of our discipline.
  • Core strategies that will help our field to gain more acceptance among scientists in general involve computer-assisted approaches, targeted data-lifting through retro-standardization, and the development of new, targeted methods.

Спасибо за ваше внимание!

image