Considering language varieties and language contact in Natural Language Processing and Machine Translation: the case of Guarani
Abstract: In spite of their diversity, the Indigenous languages of the American continent have received little attention from the technological perspective (Mager, Gutierrez-Vasques, Sierra & Meza-Ruiz, 2018). Guarani enjoys a large speaker population and in terms of its vitality, the language has been developed to the point that it is used and sustained by institutions beyond the home and community (Ethnologue 2022). However, its presence on the web is scarce, even in Paraguayan websites, a bilingual country in which both Spanish and Guarani are predominant languages. This is a challenge in itself. However, we must add the fact that there are several varieties of Guarani, mostly as a result of its contact with Spanish. The longstanding discussion over the limits of dialects and languages is a huge challenge faced in Natural Language Processing, and Guarani is a good example of this problem, especially for Machine Translation . Together with Spanish, Guarani is an official language of Paraguay, and it is also widely spoken by its non-indigenous population (Estigarribia, 2015). Its co-existence with Spanish resulted in the emergence of new varieties and language mixing, which can be traced back to colonial times in the Jesuits notes, e.g. Dobrizhoffer (1783). Guarani has only recently adopted a unique stable orthography and has a limited online presence, amongst other characteristics that make it difficult to work with from a computational viewpoint (lack of digital resources for language processing, bilingual electronic dictionaries, transcribed speech data, etc.). The Guarani-Spanish language pair has been in contact for centuries, generating several contact varieties. We propose a discussion on the many challenges faced while building the corpus due to the scarce bilingual literature and its format, as well as a discussion regarding the distinction between the many varieties of Guarani spoken in Paraguay and its mixing with Spanish. The interdisciplinary spirit of this project is also a novelty for the field, i.e. joining forces from engineering and linguistics, especially when it comes to South American academia.
Yliana Rodríguez and Luis Chiruzzo
SPEAKER & TOPIC
DATE AND PRESENTATION
Measuring the impact of subtitles on cognitive load