Annotation of Sound Change and Phonetic Alignments in EDICTOR

Agenda 2021

Background
First Steps
Editing
Analyzing
Customizing
Examples
Outlook

Background

Computer-Assisted Language Comparison

Traditional methods of historical linguistics are based on manual data annotation.
With more data available, they reach their practical limits.
Computational methods cannot replace experience and intution of experts.

Background

Computer-Assisted Language Comparison

Since experts are slow, while computers are not very accurate, we need combined frameworks that reconcile classical and computational approaches.
Computer-assisted language comparison may drastically increase the consistency of expert annotation while correcting for the lack of accuracy in computational analyses.

Background

Data and Models

Data needs to be accessible in human- and machine-readable form.
Data needs to be normalized, as far as this is possible.
Models allow us to organize the data in a formal way.
Software allows us to test if the data conforms to our model.

Background

Languages, Words, and Concepts

Normally, people provide just a language name, assuming everybody knows what variety they talk about.
Normally, people provide just a simple elicitation gloss, assuming everybody knows what they mean.
Normally, people provide some kind of transcription or short cut, assuming that everybody knows how the words should be pronounced anyway.

Background

Languages, Words, and Concepts

Normally, people provide just a language name, assuming everybody knows what variety they talk about.
Normally, people provide just a simple elicitation gloss, assuming everybody knows what they mean.
Normally, people provide some kind of transcription or short cut, assuming that everybody knows how the words should be pronounced anyway.

Background

Languages

Provide the language name in its original form, as you please (e.g., Mandarin (Běijīng)).
Provide the Glottocode in addition to your language name (e.g. beij1234).
Provide a unique identifier for the language which does not contain brackets, accents, or any other characters not in ASCII range (e.g., BeijingMandarin).

Background

Concepts

Use your original elicitation gloss, but make sure it is transparent (e.g., don't write rain, write the rain or rain (noun)).
Provide a link to the Concepticon project, if possible, if you don't find it, this is not a tragedy (e.g., 658, RAIN (PRECIPITATION)).
You can search for elicitation glosses at https://digling.org/calc/concepticon.

Background

Words

The word as you find it in a dictionary is your Value, you should never modify it from the original, where you find it (e.g., ZaoMin "mother" is given as ni44; ʑe44 in your data, so ni44; ʑe44 is your value).
If the entry has more than one word form, you should split them and list them separately, or list only one of them, this is your Form (ni44; ʑe44 thus becomes ni44 and ʑe44).
The form is not computer-readable, as it must be normalized, to account for a transcription system that can be understood by a machine. So you need to segment it and convert it to the standards of the Cross-Linguistic Transcription Systems (n i ⁴⁴ and ʑ e ⁴⁴).

Background

Wordlists

A wordlist is usually thought of as a list consisting of a different language in each column and a different concept in each row, with word forms being placed in each cell. This format is not flexible and highly unpracticable.
Our wordlist consists of a header, which indicates the values of the cells in each column, with the first column being reserved for a numeric ID.
This is also known as "long table format", and all linguists should use it and abandon their traditional wordlist format.

Background

Wordlists

Typical linguistic data as illustrated in Wu et al. (2020).

Background

Wordlists

Long-table format, required for the EDICTOR as illustrated in Wu et al. (2020).

Background

EDICTOR

The EDICTOR (List 2017) is a web-based tool that allows to edit, analyse, and publish etymological data. It is available in Version 1.0. The tool can be accessed via the website at https://digling.org/edictor/. All that is needed to use the tool is a webbrowser (Firefox, Safari, Chrome). Offline usage is also possible, but requires to change webbrowser settings. The tool is file-based: input is not a database structure, but a plain tab-separated text file (as a single sheet from a spreadsheet editor in long table format). The data-formats are identical with those used by LingPy, thus allowing for a close interaction between automatic analysis and manual refinement.

First Steps

Where is the EDICTOR?

Official Version: https://digling.org/edictor
Official Version for Team Projects: https://lingulist.de/edictor
Development Version: https://lingulist.de/edev/

First Steps

The Basic Structure of the EDICTOR

The EDICTOR structure is modular, consisting of different panels that allow for:

data editing (data input, alignments, cognate identification)
data analysis (phonological analysis, correspondence analysis)
customisation

edictor

First Steps

File Formats

File formats are straightforward:

tab-separated values (easy to export from Excel and LibreOffice)
one row corresponds to one word form, each word form needs a numerical ID
specified headers for standardized columns (DOCULECT, CONCEPT, TRANSCRIPTION, basically customisable)

files

First Steps

File Input and Output

files are "uploaded" to the EDICTOR using the JavaScript file API (this is no real upload, as it is client-side, so no data will be transferred, but all data remains on the computer of the user)
files are saved in local storage (will be lost if one closes the web-browser)
files are downloaded to the computer (again, no real download from a remote server, but the act of saving data on the client computer in a file, using the new file API in HTML 5)

DEMO

Editing

Navigation

navigation across the panels is straightforward by selecting which panels to show from a basic menu
navigatin within the panels follows specific rules which may not be completely intuitive when testing the first time
all panels have a help-tag that can be used to see the main features of a given panel

DEMO

Editing

Editing in the Wordlist Panel

new rows can be added to a given input file, but this is not encouraged, instead, it is recommended to prepare a template containing the concepts which one wants to translateinto the target languages
rudimentary support for the creation of customised questionnaires is provided (more support will be provided with the next official Concepticon release)
SAMPA input for phonetic transcriptions is supported
automatic segmentation of phonetic transcriptions is supported
Pīnyīn input for Chinese characters is supported but not reliable yet

DEMO

Editing

Cognate Annotation (Cognate Panel)

Cognates can be assigned to words using two rudimentary operations:

NEW: assign a new cognate set ID to the selected words
COMBINE: combine the cognate set IDs of the selected words

The Cognates panel is synchronized with the wordlist panel: when assigning cognates within meaning slots, the wordlist panel is automatically filtered to show only the words under consideration.

The Cognates panel also allows to directly align the words which were assigned to the same cognate set.

DEMO

Editing

Partial Cognate Annotation (Partial Cognates Panel)

Partial cognates can be assigned to words provided that the data is segmented into morphemes. The assignment follows a very intuitive schema:

from a bunch of words, morphemes are selected and then assigned to a new partial cognate ID
words marked as cognate are shown in the same column, where they can be aligned
clicking on the words in the column deletes them from the partial cognate sets

DEMO

Editing

Morpheme Annotation (Morpheme Glosses Panel)

Morpheme Glosses

the morphemes of a given word are annotated similar to glossed text in syntax annotations
spaces serve as a separator between morpheme glosses, so morpheme glosses are not allowed to have spaces
apart from that, the format is free, but identical strings in different words will be interpreted as actually cognate morphemes inside the language

Editing

Morpheme Annotation (Morpheme Glosses Panel)

Morpheme Glosses Panel (new in EDICTOR 2.0)

allows to annotate morphemes by assigning cross-semantic language internal cognates
allows for quick segmentation of words into morphemse
allows to search for language-internal cognates
is crucial for the integration of vertical and horizontal language comparison

DEMO

Analysis

Phonology Analysis (Phonology Panel)

In order to check how well a given transcription was carried out, the Phonology panel can be used. It lists all phonemes for a given language (proper segmentation is required) and their frequency of occurrence. The expert can thus check the correctness of very rare phonemes or weird characters. Since the Phonology panel links to the Wordlist panel, experts can quickly find the words containing specific phonemes and correct them or inspect them. In addition, an IPA chart can be displayed to check the structural properties of the sound system of a given language.

DEMO

Analysis

Morphological Analysis (Morphology Panel)

Quasi-Deprecated and Superceded by the Morpheme Glosses Panel

This panel was an early attempt to handle morpheme glosses, which offers some potentially interesting features like bipartite word family graphs, but will otherwise not really help you much to advance your data, so I recommend to ignore it.

DEMO

Analysis

Sound Correspondences (Correspondences Panel)

Quasi-Deprecated and Superceded by the Correspondence Patterns Panel If alignments are provided for a given dataset, one can use the Correspondences Panel of the EDICTOR to compare the frequency of sound correspondences between language pairs. In this way, errors in cognate assignment or alignment analyses can be quickly corrected and a general idea regarding regular sound correspondences can be derived. Sound correspondences can also be defined for a given context. This needs to be submitted by the user as additional data in an additional column, or can be automatically computed, based on the idea of prosodic strings (List 2014) which assign each sound to a given value based on its prosodic weight.

DEMO

Analysis

Correspondence Patterns (Correspondence Patterns Panel)

You can also inspect correspondence patterns across all the languages in your sample, provided you have -- again -- completely aligned your data. A first method for correspondence pattern identification was proposed by List (2019), but the algorithm is time consuming and therefore only available in Python and you have to analyze your data with the algorithm and then load the file into the EDICTOR to inspect your data properly. However, the EDICTOR offers a very simple greedy solution that you can use for quick data inspection and which usually shows the most frequent patterns in your data.

DEMO

Analysis

Cognate Sets (Cognate Sets Panel)

Inspecting how the cognate sets are distributed in your data is very useful to get a direct impression into certain aspects of subgrouping. You can easily do this with help of the Cognate Sets panel of the EDICTOR. In addition, you can also export your data to the Nexus format from here and use the file in biological software packages, such as SplitsTree, to infer a quick tree or network.

DEMO

Customization

Templates

Templates can be used to develop first questionnaires that can then be filled out with help of the EDICTOR. Template functionality is still rudimentary in the EDICTOR. Users can select among different concept lists (Swadesh, Blust, etc.) and also merge multiple concept lists. More finegrained operations (the intersection of concept lists, or mergers which take concept similarity into account) are not yet implemented, but are currently developed for the Concepticon, where they will be available with the next official release (planned for 2017).

DEMO

Customization

Database Interface

The EDICTOR has a database backend that allows to store data automatically on a server. In order to support this, databases need to be explicitly created, and there is no official way to do this at the moment: Users who whish to use the database backend need to ask me to set up a database and create passwords for them. The interface for databases in the Customize menu allows to select those items of a given database, which users want to inspect. These are stored in a link that users can then bookmark and use whenever they want to work on the data.

DEMO

Customization

Custom Settings

Users can customize a great part of the EDICTOR's default settings. This is done with help of specific URLs that the user can bookmark to call the tool in their preferred view. For example, if one wants to see the alignments immediately when loading a file, this can be specified. If users prefer to use their own IPA keyboard instead of SAMPA conversion, this is also possible. More possibilities for customization will be added in the future, but already for the moment, the EDICTOR offers a great deal of flexibility.

DEMO

Customization

Publishing Data

The EDICTOR can also be used as a convenient interface to publish data in a nice form on the web. As it is purely text-based, all that is needed is to clone the EDICTOR software and host it on a server of one's choice. Then, by using customized URLs one can present a given dataset in read-only mode. In this way, users cannot edit the data, but they can use the interactive possibilities to inspect it.

Handling Sound Change

Regular and Irregular Cognates

the handling of sound change requires that only regular cognates be annotated, since only these can be aligned
irregular cognates can be ignored or put at the side by adding an additional ROOTIDS column in which overarching roots which are nevertheless not regular are collected and later systematically explained

Handling Sound Change

Clustering Sounds into Evolving Units

sometimes it is useful to group sounds into one unit, although they are distinct sounds
this allows us to circumvent the problem of conditioning context (the group is the context)
an experimental feature in EDICTOR allows us to group sounds by replacing the space by a dot .

DEMO

Handling Sound Change

Extending the Semantics of the Slash Construct

the slash-notation so far can be used to separate some source sound from a target sound
this can be used to represent a sound for EDICTOR as a vowel like [ə] while preserving original orthography before the slash h₂/ə
but we can also use the construct to mark the attested sound and contrast it with the expected sound, thus using a Grimm-like style to annotate our exceptions in our data without discarding them

DEMO

Handling Sound Change

Proto-Languages and Correspondence Patterns

to be able to browse through correspondence patterns and investigate them, you need to make a whatevery simple reconstruction of a larger amount of cognate sets first
add a dummy proto-language to your dataset
start to go through alignments and insert words for the dummy proto-language and align the data with the cognate sets

Handling Sound Change

Proto-Languages and Correspondence Patterns

select the proto-language in the settings, so you can search for it in the Correspondence Patterns Panel
when investigating correspondence patterns, try to make all patterns match by marking irregular sounds with teh slash construct, but merging sounds, and by checking which sound changes can be predicted easily from context

DEMO

Gracias a todås!