Sino-Tibetan Lexicostatistic Database


An Interactive System for Transparent Etymological Analysis

image

Agenda 2017

  • Introduction
  • Representation
  • Annotation
  • Sino-Tibetan Lexicostatistic Database

Introduction

img

Introduction

Lexicostatistic Databases

  • many lexicostatistical datasets and databases have been created recently, or are being created (IELex, ABVD, Chirila, GLD, etc.)
  • how the data is assembled and annotated often differs (user-input, spreadsheet, text files, filemaker)
  • data is often only shared in the online databases, and custom exports are used for those creating the database

Introduction

Problems

  • we lack established formats for data-sharing and data-curation
  • we lack standards and best practices for cognate annotation
  • criteria for cognacy are usually not transparently communicated
  • problems involving etymological relations going beyond cognacy (partial cognacy, oblique cognacy) are largely ignored
  • data annotation is not accompanied by consistency checks

Introduction

Requirements

  • standards for data-sharing
  • interfaces for fast and consistent data entry and annotation
  • increased detail in data annotation (no abstract cognate sets, no abstract orthographies)

Representation

img

Representation

The CLDF Initiative

The Cross-Linguistic Data Formats initiative (Forkel et al. 2016, http://cldf.clld.org) aims at increasing the comparability of cross-linguistic. It comes along with:

  • standardization efforts (linguistic databases like Glottolog and Concepticon),
  • software APIs which help to test whether data conforms to the standards, and
  • working examples for best practice

Representation

The CLDF Initiative

As of now, a couple of software tools (LingPy, Beastling, EDICTOR) support CLDF. In the future, we hope that the number of users will increase, and that the community helps to develop the formats further.

Representation

The CLDF Initiative: Basics

Representation

Concepticon

The Concepticon is an attempt to link the large amount of different concept lists which are used in the linguistic literature, ranging from Swadesh lists in historical linguistics to naming tests in clinical studies and psycholinguistics.

Representation

Concepticon

This resource, our Concepticon, links concept labels from different conceptlists to concept sets. Each concept set is given a unique identifier, a unique label, and a human-readable definition. Concept sets are further structured by defining different relations between the concepts.

http://concepticon.clld.org

Representation

Concepticon

With the verson 1.1 of Concepticon, which will be released later in 2017, many new features become available:

  • increased data (about 200 different concept lists)
  • improved automatic linking of new concept lists (multi-lingual, data-driven, efficient)
  • improved methods for concept list comparison

Representation

STDB API

  • a Python API (https://github.com/digling/sinotibetan/) is used to manage and curate the data
  • data is stored in text-files in CLDF-compatible format
  • data is versionized via git
  • specific commands allow to check and analyse the data
  • new data is entered by first linking it to Concepticon and then extracting the 250 words of our basic concept list
  • the interaction with the Concepticon API and other specifically designed software libraries (LingPy, etc.) increases the transparency and efficiency of data entry (but it's still tedious!)

Annotation

img

Annotation

Sound Segments

  • all words can be represented in original orthography, but additionally, they need to be given in phonetic transcription
  • transcriptions need to be segmentized explicitly, since phonetic transcription is often ambiguous
  • by testing how well the data fits into the cross-linguistic phonetic notation framework, we automatically check it's internal consistency

Annotation

Cognate Sets

  • ideally, cognate sets should account for partial cognacy and for deep root-based cognacy
  • partial cognacy applies to compounds, root-based cognacy applies to cognate words which go back to different stems of the same root
  • in both cases, a clear cognate decision is difficult to make, and only a language-internal analysis will usually provide linguists with the information needed to resolve the problems
  • alternatively, one can annotate root-cognacy, but additional information is needed in order to make cognate judgments trasparent

Annotation

Alignments

  • alignments are essential to represent transparently where cognate words are homologous
  • especially when dealing with diverse languages it is not obvious for non-experts in the given language family to understand which parts of two or more words are homologous
  • alignments are not only a service to the user but also help the annotator to have a transparent perspective on the proposals

Annotation

The EDICTOR Tool

The EDICTOR is a web-based tool that allows to edit, analyse, and publish etymological data. It is available as a prototype in Version 0.1 and will be further developed in the project "Computer-Assisted Language Comparison" (2017-2021). The tool can be accessed via the website at http://edictor.digling.org, or be downloaded and used in offline form. All that is needed to use the tool is a webbrowser (Firefox, Safari, Chrome). Offline usage is currently restricted to Firefox. The tool is file-based: input is not a database structure, but a plain tab-separated text file (as a single sheet from a spreadsheet editor). The data-formats are identical with those used by LingPy, thus allowing for a close interaction between automatic analysis and manual refinement.

Annotation

The EDICTOR Tool: Structure

The EDICTOR structure is modular, consisting of different panels that allow for:

  • data editing (data input, alignments, cognate identification)
  • data analysis (phonological analysis, correspondence analysis)
  • customisation

EDICTOR.DIGLING.ORG

Analysis

img

Analysis

Simple Stats

We have aggregated a larger set of different languages and concepts, but we now decided on a subset of

  • 250 concepts, which have a reasonable-enough coverage with alternative concept lists: 146 overlap with Blust's list, 113 with Matisoff's list, 148 with Comrie's list (=Swadesh 1952 and Swadesh 1955 combined)
  • 28 languages (to be expanded, hopefully soon)
  • 7080 words
  • minimal coverage: 80% (Hakha), average coverage: 96%

Analysis

Cognate Statistics

Analysis of cognate sets by Laurent and Guillaume so far gave the following statistics:

  • 4006 cognate sets (3159 singletons!)
  • "four" and "the name" cognate across all languages, followed by "to die", "five", "three" (27), "the ear" (26), "six" (25), "nine", "two" (24), "the blood", "the dream", "eight", "the eye", "the smoke", "thou" (23), "I", "to kill", "the pig" (22), "the fire" (21), "firewood" (20)
  • cognate data extremely sparse

Analysis

Phylogenies

  • initial tests with MrBayes show divergence dates between 7000 and 6000 years
  • tree topology shows Old Chinese consistently as the first group to spread off
  • if this turns out to be true, we can get back to calling the family "Sino-Tibetan" again

Outlook

img

Outlook

We need to

  • increase the number of languages in the sample (but this will take some time), 50 seems a realistic goal for 2017,
  • add information on partial cognacy (this will increase the signal)
  • add aligned forms for the cognate sets (to conform to our transparency standards)
  • discuss about clear-defined sets of priors for phylogenetic reconstruction

Thanks for Your Attention!

image