A Computer-Assisted Approach
Given an alphabet (a non-empty finite set, whose elements are called characters), a sequence is an ordered list of characters drawn from the alphabet. The elements of sequences are called segments. The length of a sequence is the number of its segments, and the cardinality of a sequence is the number its unique segments. (cf. Böckenbauer and Bongartz 2003: 30f)
Sequence comparison plays a crucial role in historical linguistics. Despite this role, however, scholars working in an exclusively qualitative paradigm barely consider the importance of segmentation and alignments in their work, and although they implicitly use alignments, they often even neglect them when being asked to make them explicit.
Automatic sequence comparison cannot yet compete with trained linguists. Linguists, however, are human, which means that they necessarily make errors, sooner or later (before or after lunch). Using computational techniques for cautomatic cognate detection may reveal these inconsistencies and help the annotators to improve their data.
We do not have annotated data for all languages in the world. Before starting to collect them, we may want to explore data for which no cognate judgmnets exist. This can help us to:
If data can be processed by a computer, it can also be checked. Once your data is machine-readable, you can do a lot of interesting things beyond searching for cognates, such as:
If automatic cognates can be computed, you can also compute automatic alignments, and alignments have a huge advantage over a simple listing of cognate sets, since they make even the human judgment much more transparent.
If we want to bring historical linguistics into the quantitative and empirical era, we need to increase the transparency of our judgments and analyses. By producing data which is not only human- but also computer-readable, we cannot only use it to carry out automatic tests, but we will generally help to put the comparative methods on more solid grounds.
Specific formats are needed to analyse linguistic data with LingPy and search automatically for cognates. The basic format used in LingPy is a tab-separated input file in which the first row serves as a header and defines the content of the rest of the rows. The very first column is reserved for numerical identifiers, while the order of the other columns is arbirtrary. To a large degree, this input format is compatible with the one advocated by the CLDF initiative.
The Cross-Linguistic Data Formats initiative (Forkel et al. 2016, http://cldf.clld.org) aims at increasing the comparability of cross-linguistic. It comes along with:
ID CONCEPT COUNTERPART IPA DOCULECT COGID
1 hand Hand hant German 1
2 hand hand hænd English 1
3 hand рука ruka Russian 2
... ... ... ... ... ...
@note:Careful, this data could have been charmed!
ID CONCEPT COUNTERPART TOKENS DOCULECT COGID
# this is a field used for comments
1 hand Hand h a n t German 1
2 hand hand h æ n d English 1
3 hand рука r u k a Russian 2
... ... ... ... ... ...
ID CONCEPT DOCULECT COGID TOKENS ALIGNMENT
1 Harry German 1 h a r a l t h a r a l t
2 Harry English 1 h æ r i h æ r i - -
3 Harry Russian 1 g a r i g a r i - -
... ... ... ... ... ...
LingPy offers a generic Wordlist
class to represent and manipulate word lists. From the Python terminal (e.g., Ipython), you can easily read a wordlist and then either query or modify its content:
>>> from lingpy import *
>>> wl = Wordlist('polynesian.tsv')
>>> print("Concepts:", wl.height, "Languages:", wl.width)
Concepts: 210 Languages: 37
Wordlists allow you to slice data in different ways. In general, internally, they represent data from the perspective of the "row" (the concept) and the "column" (the language). To query which words (represented by the indices) occur in the concept for "hand/arm", you can thus write:
>>> wl.get_list(row="HAND", flat=True)
[2050, ...]
>>> wl.get_list(col="Anuta", flat=True)
[2179, ...]
More information can be found at http://lingpy.org/tutorial/lingpy.basic.wordlist.html.
The most crucial aspect is that you can retrieve every word in the data with help of its integer key. Using only the key, you will get the full row of a wordlist object, but if you add the column name, you will get the value in the respective cell:
>>> wl[2050]
['abvd-abvd-253-1', 'anut1237', '1277', 'Anuta', 'HAND', 'rima',
['r', 'i', 'm', 'a']
>>> wl[2050, 'tokens']
['r', 'i', 'm', 'a']
LingPy can automatically segment the phonetic data provided by the user. If LingPy compares sequences, these are in general converted to an internal format which breaks down their complexity to that of a smaller alphabet of sound classes.
>>> ipa2tokens('tʰɔxtɐ')
['tʰ', 'ɔ', 'x', 't', 'ɐ']
>>> tokens2class(['tʰ', 'ɔ', 'x', 't', 'ɐ'], 'sca')
['T', 'U', 'G', 'T', 'E']
If LingPy does not correctly recognize a symbol (e. g., because it is erroneous), it represents it internally as a "0". We can use this to check the quality of the input data.
>>> tokens2class(['tʰ', '?', 'x', 't', 'ɐ'], 'sca')
['T', '0', 'G', 'T', 'E']
It is crucial to check that all data has been converted properly and that LingPy "knows" all symbols. Otherwise, many strange things can happen in the analysis, and the results will be disappointing.
Coverage (how many data points are there per word and language in a given dataset) is crucial for the success of the more advanced analyses provided by LingPy (LexStat, LexStat with Infomap clustering). If coverage goes beyond a certain level (rule of thumb: less than 100 word pairs in two languages), no sound-correspondence signal can be found, and LingPy won't find cognates even in words which look obviously similar.
Two major aspects of sequence comparison include:
Both can be carried out automatically in LingPy and visualized and inspected with help of the EDICTOR tool. Both analyses are based on rather complicated workflows which require a good knowledge of the relevant papers to understand all parameters that can be used when applying the algorithms. All algorithms are described in detail in List (2014). It is recommended to be careful when modifying the parameters and to inform the LingPy core developer team (http://github.com/lingpy/lingpy) in case of strange results.
There are three major phases in LexStat:
The tutorial is available on GitHub at https://github.com/shh-dlce/qmss-2017 (folder LingPy
).
We inspect the results with help of the EDICTOR tool at http://edictor.digling.org (List 2017).
By exporting the data to NEXUS, or to Phylip Distance format, we can further load the results of the analysis into software packages like SplitsTree or BEAST, which allow for more fine-grained and detailed phylogenetic analyses.
Thanks for Your Attention!