Problems in Assessing the Probability of Language Relatedness

Agenda 2020

Introduction
Initial Thoughts on Tests and Proofs
How to Design Tests for Relatedness?
Reviewing Tests in the Literature
Conclusion

Introduction

Monophyly

La Société n'admet aucune communication concernant, soit l'origine du langage, soit la création d'une langue universelle. [The society does not accept any kind of communication, neither on the origin of language, nor on the creation of a universal language] (Statuts of the Society of Linguistics in Paris, 1866)

Introduction

The Burden of Proof

Since linguists do not take a monophyletic origin of language for granted, they can not -- unlike their colleagues from biology -- start and compare all languages freely.
Before they do so, they must sit down and search for sufficient proof that the languages they want to investigate comparatively share indeed a common ancestor.

Introduction

Where is the Proof?

The biggest problem for the proof of genetic relationship is that scholars themselves seem to have quite different opinions on those shared traits among languages that would count as evidence for relatedness.
This discussion is often reflected in a supposed debate about grammar as proof on the one hand and lexicon as proof, on the other hand, although this reduction is in fact highly misleading, since grammar also refers to the form of the linguistic sign.

Introduction

Passing Tests

In order to circumvent this problem, scholars have repeatedly tried to produce statistical tests that would help them to prove language relationship in a more objective manner.
The basic idea of most of these tests is to show that it is highly unlikely that a certain similarity pattern, as it can be observed between two or more languages, has evolved by chance.
Starting with the work by Ringe (1992), there have been quite a few attempt to design some ultimate test for genetic relatedness.

Initial Thoughts on Tests and Proofs

Absolute Proofs and Relative Tests

Classical historical linguistic scholarship treats genetic relationship in a manner similar to mathematicians who use proofs to mark a problem as solved. If the proof for genetic relationship has been identified (which may require a lot of genius), the problem is settled, and the reconstruction can begin.

Initial Thoughts on Tests and Proofs

Absolute Proofs and Relative Tests

The idea of designing a test, however, is fundamentally different in this regard, since tests only offer approximations to problems, and they are always accompanied by rates of false positives and false negatives.
Designing a test to prove something is therefore an enterprise which is problematic already, since the idea of testing and the idea of finding a proof are fundamentally different.
This conflict is also one of the reasons for the numerous discussions among those who favor proofs and those who favor tests when it comes to dealing with genetic relationship.

Initial Thoughts on Tests and Proofs

Test Theory and Construct Validity

Constructs in the social sciences:

A construct refers to something that is not directly observable, but that may leave traces in tests.
While psychologists tend to believe that the concept of intelligence is real, they agree that they cannot really investigate it, and what they investigate instead is their construct of intelligence.
Constructs can be defined in different ways, and some constructs of intelligence will perform better than others, when putting them to the test.

Initial Thoughts on Tests and Proofs

Test Theory and Construct Validity

Construct thinking can help to channel discussions about the "nature of the proto-language":

We can treat the proto-language which we can infer as a construct, while we know that this construct is not equivalent with the proto-language as it was once spoken.
Our linguistic reconstruction thus addresses our construct of the proto-language, and this construct is the "fiction or story put forward by a theorist to make sense of a phenomenon" (Statt 1998[1980]: 67).

Initial Thoughts on Tests and Proofs

Test Theory and Construct Validity

Control criteria for for scientific testing:

construct validity: describes "[...] how well each item of a [...] test measures or predicts what it’s supposed to measure or predict" (Statt 1998: 30).
reliability: describes "a particular observation has yielded a replicable score" (Liebert and Liebert 1995).
objectivity: "kennzeichnet die Unabhängigkeit seines Ergebnisses von der Person, die den Test durchführt" ["refers to the independence of the result from the person conducting the test"] (Maderthaner 2008: 89).

Initial Thoughts on Tests and Proofs

Test Theory and Construct Validity

Given that scholars often treat the proof of genetic relatedness as a specific test, the general control criteria developed in the test theory of the social sciences should also apply to them.
As a result, we can use the notion of constructs and construct validity and try to assess the quality of proposed tests.
Furthermore, when working on tests for genetic relationship in the future, it will be helpful to keep these basic criterias in mind.

How to Design Tests for Relatedness?

Objectivity

In historical linguistics, we look at a long history of celebrated individual scholarship that used intuition to solve the greatest linguistic riddles.
In this respect, we share some tradition with the field of mathematics, and the idea that proofs are detected, and not inferred by an automated test.
But tests should work without geniuses applying them, and they cannot rely on the intuition of a person.

How to Design Tests for Relatedness?

Objectivity

How to guarantee objectivity:

We can automate our test procedure to such a degree that we no longer need any person conducting the test (formalize the approach).
We can ask several colleagues to conduct the same test and see to which degree they converge in their assessments (check inter-annotator agreement).

How to Design Tests for Relatedness?

Reliability

Since there are many different situations which may be different when re-applying a test, it is not easy to give a clearcut assessment of how to assess the reliability of a given test completely.
Since there are many different situations which may be different when re-applying a test, it is not easy to give a clearcut assessment of how to assess the reliability of a given test completely.

How to Design Tests for Relatedness?

Reliability

Tests should be applicable to a coherent set of languages.
Tests should state clearly the time depths until when they are expected to yield reliable results.
Tests should be accompanied by a complete evaluation of their usefulness on cross-linguistic gold standards.

How to Design Tests for Relatedness?

Validity

Validity is hard to guarantee, because it is quite different to find clear-cut, formal criteria that would tell us how to check for the validity of a given test, because our target is a construct, not an entity which we can directly observe.
But keeping in mind that validity means that the test measures what it is supposed to test, it is important to reflect about the major problems we have in identifying genetically related traits.

How to Design Tests for Relatedness?

Validity

Different kinds of lexical similarity (List 2014)

How to Design Tests for Relatedness?

Validity

Thus, if we design a test for genetic relatedness, we need to make sure that we do not measure any other kind of relatedness.
It is obvious that this is very difficult, and scholars have tried to circumvent the problem in many ways, be it by comparing traits that are difficult to borrow, or by comparing language pairs where language contact can be ideally excluded.

Reviewing Tests in the Literature

Preliminaries

What is a good test?

[1] the test should either be fully automated or otherwise it should come along with extensive tests on inter-annotator agreement
[2] the test should be applied to a large gold standard of different languages from different time depths
[3] the test should be able to cope with contact-induced and natural similarities (e.g., sound symbolism)

Reviewing Tests in the Literature

Correspondence-Based Approaches

Inspired by Ringe (1992), who infers correspondence statistics from semantically aligned wordlists of 200 concepts (based on Swadesh 1955), later criticized by Baxter (1996) and then further developed by Kessler (2001).

Reviewing Tests in the Literature

Correspondence-Based Approaches

Workflow:

Compile a wordlist for all languages by translating the items of the concept list into the target languages, restricting translations to only one word.
Make a matrix of n sounds occuring as initials in language A and m sounds occurring as initials in language B, and fill in in how many cases each sound pair cooccurs.
Evaluate the results statistically.

Reviewing Tests in the Literature

Correspondence-Based Approaches

Problems:

The procedure by which the wordlists are compiled is not objective (no inter-annotator agreement was tested, see also Geisler and List 2010).
The procedure lacks an assessment of its reliability, as it was only tested on a few languages.
Validity is unclear, as there is no explicit control for borrowings or sound symbolisms.
The model of relatedness does not allow for semantic change.
Restricting matches to initial sounds deprives the test of its power.

Reviewing Tests in the Literature

Excursus: Wordlist Compilation

Uncertainty resulting from arbitrary synonym choices (List 2018).

Reviewing Tests in the Literature

Sound-Class-Based Approaches

Sound-Class-Based Approaches:

Ultimately going back to Dolgopolsky (1964), sound classes have been employed by many scholars as a way to handle extreme phonetic and phonological variation and to allow for an exploratory analysis of language data in search for potential cognates. Starting with Baxter and Manaster Ramer (2000), Dolgopolsky's approach was further supplemented by statistical tests (usually based on permutation), and then also applied to more languages by Turchin et al. (2010) or proto-languages (Kassian et al. 2015).

Reviewing Tests in the Literature

Sound-Class-Based Approaches

Workflow:

Compile a wordlist and extract consonant classes for the target languages.
Count the number of matching consonant classes for each language pair.
Assess the assumed significance of the matches by applying permutation tests.

Reviewing Tests in the Literature

Sound-Class-Based Approaches

Problems:

Wordlist compilation (specifically when doing this for proto-languages).
Gold standard testing (although more tests have been carried out than for correspondence-based approaches).
Control for borrowing and sound symbolism still unclear.
No semantic change.
Low power of the test compared to other approaches.

Outlook

Wishlist for future methods:

[A] methods that are based on open research paradigms, with open data, and open code, so that others can re-apply, newly apply, and build on them,
[B] methods that are exhaustive, being tested on more data, not just on a couple of languages in which the scholars are interested, and
[C] methods that are explicit on their limitations and provide exhaustive statistics on their success and failure on a gold standard.

Outlook

When discussing language relatedness in the light of test theory and construct validity, there is one obvious problem, however, where the parallel to "testing" as we know it from medicine and psychology, does not hold.
When looking back at the history of Indo-European linguistics and historical linguistics, we do not see that people tested that Sanskrit and Greek are related, but rather that they detected this relationship.
Scholars happened to compare the right traits and to find evidence as striking as a video proof conducting a murderer. Often, scholars still think about relatedness in absolute dimensions, not in relative dimensions. In the future, we need to advance in both directions: we need to strengthen our heuristics and we need to strengthen our tests.

Спасибо за ваше внимание!