A Web-Based Interactive Tool for Creating, Inspecting, Editing, and Publishing Etymological Datasets
Workflows in Computer Assisted Language Comparison are based on an iterative process in which data is constantly passed back and forth between experts and machines. The specifics of a given workflow depend on the task, and they may vary drastically. In lexical reconstruction, for example, essential steps involve cognate identification, alignment analyses, and the reconstruction of proto-forms. But more specific tasks, like internal reconstruction or the detection of borrowed forms may also be relevant.
The EDICTOR is a web-based tool that allows to edit, analyse, and publish etymological data. It is available as a prototype in Version 0.1 and will be further developed in the project "Computer-Assisted Language Comparison" (2017-2021). The tool can be accessed via the website at http://edictor.digling.org, or be downloaded and used in offline form. All that is needed to use the tool is a webbrowser (Firefox, Safari, Chrome). Offline usage is currently restricted to Firefox. The tool is file-based: input is not a database structure, but a plain tab-separated text file (as a single sheet from a spreadsheet editor). The data-formats are identical with those used by LingPy, thus allowing for a close interaction between automatic analysis and manual refinement.
The EDICTOR structure is modular, consisting of different panels that allow for:
File formats are straightforward:
Cognates can be assigned to words using two rudimentary operations:
The Cognates panel is synchronized with the wordlist panel: when assigning cognates within meaning slots, the wordlist panel is automatically filtered to show only the words under consideration.
The Cognates panel also allows to directly align the words which were assigned to the same cognate set.
Partial cognates can be assigned to words provided that the data is segmented into morphemes. The assignment follows a very intuitive schema:
Annotating morpheme structures is an important first step for language-internal reconstruction, as it allows us to assign words in our data to word families. The EDICTOR offers a very innovative module that automatically searches for potential partial colexifications in morphologically presegmented data which can then be quickly annotated by an expert.
The core idea of annotation is:
In order to check how well a given transcription was carried out, the Phonology panel can be used. It lists all phonemes for a given language (proper segmentation is required) and their frequency of occurrence. The expert can thus check the correctness of very rare phonemes or weird characters. Since the Phonology panel links to the Wordlist panel, experts can quickly find the words containing specific phonemes and correct them or inspect them. In addition, an IPA chart can be displayed to check the structural properties of the sound system of a given language.
If alignments are provided for a given dataset, one can use the Correspondences Panel of the EDICTOR to compare the frequency of sound correspondences between language pairs. In this way, errors in cognate assignment or alignment analyses can be quickly corrected and a general idea regarding regular sound correspondences can be derived. Sound correspondences can also be defined for a given context. This needs to be submitted by the user as additional data in an additional column, or can be automatically computed, based on the idea of prosodic strings (List 2014) which assign each sound to a given value based on its prosodic weight.
Templates can be used to develop first questionnaires that can then be filled out with help of the EDICTOR. Template functionality is still rudimentary in the EDICTOR. Users can select among different concept lists (Swadesh, Blust, etc.) and also merge multiple concept lists. More finegrained operations (the intersection of concept lists, or mergers which take concept similarity into account) are not yet implemented, but are currently developed for the Concepticon, where they will be available with the next official release (planned for 2017).
The EDICTOR has a database backend that allows to store data automatically on a server. In order to support this, databases need to be explicitly created, and there is no official way to do this at the moment: Users who whish to use the database backend need to ask me to set up a database and create passwords for them. The interface for databases in the Customize menu allows to select those items of a given database, which users want to inspect. These are stored in a link that users can then bookmark and use whenever they want to work on the data.
Users can customize a great part of the EDICTOR's default settings. This is done with help of specific URLs that the user can bookmark to call the tool in their preferred view. For example, if one wants to see the alignments immediately when loading a file, this can be specified. If users prefer to use their own IPA keyboard instead of SAMPA conversion, this is also possible. More possibilities for customization will be added in the future, but already for the moment, the EDICTOR offers a great deal of flexibility.
The EDICTOR can also be used as a convenient interface to publish data in a nice form on the web. As it is purely text-based, all that is needed is to clone the EDICTOR software and host it on a server of one's choice. Then, by using customized URLs one can present a given dataset in read-only mode. In this way, users cannot edit the data, but they can use the interactive possibilities to inspect it.
Thanks for Your Attention!