3 Problems and Possible Solutions
3 Problems and Possible Solutions
Automatic Morpheme Segmentation
The task:
Given a list of less than 1000 words in phonetic transcription, readily segmented into sounds, with concepts mapped to common concept lists (e.g., Concepticon), identify the morpheme boundaries in the data.
3 Problems and Possible Solutions
Automatic Morpheme Segmentation
Current solutions:
- most algorithms build on n-grams (recurring symbol sequences of arbitrary length), assuming that n-grams representing meaning-building units should be distributed more frequently across the lexicon of a language, they assemble n-gram statistics from the data
- with Morfessor, there is a popular family of algorithms avilable in form of a stable library (Creutz and Lagus 2005, virpioja et al. 2013)
3 Problems and Possible Solutions
Automatic Morpheme Segmentation
How do current solutions perform?
- comparing the performance with the online Morfessor demo, we can see that the results are often disappointing
- even when being trained on large amounts of data, the algorithms do not seem to reach a high accuracy
- when being trained with less than 1000 words, they fail gloriously
3 Problems and Possible Solutions
Automatic Morpheme Segmentation
3 Problems and Possible Solutions
Automatic Morpheme Segmentation
Why is the task so difficult?
- morphemes are ambiguous, they are not only based on the form, but also on semantics
- even speakers may at times no longer understand the original morphology of their language (folk etymology, etc.)
3 Problems and Possible Solutions
Automatic Morpheme Segmentation
What do humans do to find morphemes?
- humans take semantics into account (e.g., compare Spanish hermano "brother" with hermana "sister")
- humans know that morphological structure varies across languages (compare SEA languages vs. Indo-European languages)
- humans try to infer phonotactic rules
- humans make use of cross-linguistic evidence
3 Problems and Possible Solutions
Automatic Morpheme Segmentation
Suggestions for solutions:
- employ semantic information (make use of resources, such as CLICS, Concepticon, etc.)
- employ phonotactic information (make use of the prosody models in LingPy)
- employ cross-linguistic information (use LingPy's sequence comparison techniques)
- give up the idea of a universal morpheme segmentation algorithm (rather proceed from linguistic areas)
- invest time to create datasets for testing and training
3 Problems and Possible Solutions
Automatic Contact Inference
The task:
Given word lists of different languages, find out which words have been borrowed, and also determine the direction of borrowing.
3 Problems and Possible Solutions
Automatic Contact Inference
Current solutions:
- conflicts in the phylogeny, explain them by invoking borrowings (MLN approach, Nelson-Sathi et al. 2011, List et al. 2014)
- similar words among unrelated languages (Mennecier et al. 2016)
- tree reconciliation methods (Willems et al. 2016)
- borrowability statistics (Sergey Yakhontov, as reported by Starostin 1990, Chén 1996, McMahon et al. 2005)
3 Problems and Possible Solutions
Automatic Contact Inference
How do current solutions perform?
- conflicts in the phylogeny tend to overestimate the amount of borrowing, since there are multiple reasons for conflicts in phylogenies, not only borrowing (Morrison 2011)
- sequence comparison on unrelated languages seem solid, but one needs to be careful with chance resemblances based on onomatopoetic words etc. (mama, papa, etc., Jakobson 1960)
- tree reconciliation methods are unrealistic if word trees are derived from simple edit distances
- sublist-approaches may be useful, but they require large accounts on known borrowings, which we usually lack
3 Problems and Possible Solutions
Automatic Contact Inference
Why is the task so difficult?
- detecting borrowing presupposes to exclude alternative reasons (inheritance, natural patterns, chance)
- no unified procedure for the identification of borrowings in the classical dispipline
- borrowing detection is much more based on multiple types of evidence than other disciplines
3 Problems and Possible Solutions
Automatic Contact Inference
What do humans do to find borrowings?
- search for phylogenetic conflicts (English mountain, French montagne)
- search for trait-related conflicts (German Damm, English dam)
- areal proximity (as a pre-condition)
- borrowability (in cases of doubt)
3 Problems and Possible Solutions
Automatic Contact Inference
Suggestions for solutions:
- increase cross-linguistic data in phonetic transcription and consistent definition of meanings to allow for search of similar words among unrelated languages
- test methods for automatic correspondence pattern recognition and search for trait-related conflicts (List 2019)
- work on cross-linguistic datasets of known borrowed words to increase our knowledge of borrowability
3 Problems and Possible Solutions
Automatic Sound Law Induction
The task:
Given a list of words in an ancestral language and their reflexes in a descendant language, identify the sound laws by which the ancestor can be converted into the descendant.
3 Problems and Possible Solutions
Automatic Sound Law Induction
Current solutions:
- simulation studies (black boxes, see e.g., Ciobanu and Dinu 2018) for word prediction
- manual tools to model sound change when providing sound laws (PHONO, Hartmann 2003)
3 Problems and Possible Solutions
Automatic Sound Law Induction
How do current solutions perform?
- problem of handling conditioning context (be it long-distance or abstract)
- no direct solution to the task at hand
3 Problems and Possible Solutions
Automatic Sound Law Induction
Why is the task so difficult?
- problem of handling context of arbitrary distance to target sound
- problem of handling "abstract" context (suprasegmentals)
- problem of handling systemic aspects of sound change (where sound change is modeled in features)
3 Problems and Possible Solutions
Automatic Sound Law Induction
Suggestions for solutions
- multi-tiered sequence modeling (List 2014, List and Chacon 2015)
3 Problems and Possible Solutions
Automatic Sound Law Induction
3 Problems and Possible Solutions
Automatic Sound Law Induction
3 Problems and Possible Solutions
Automatic Sound Law Induction
- by modeling all different possible conditioning contexts, we make sure that we can find the context that conditions a sound change
- by selecting those which actually do condition a sound change, using computational tools, we can identify and propose potential environments of varying degrees of abstractness
- we still need, however, to reflect, how to handle systematic aspects of sound change