The curation, distribution and maintenance of reusable digital text and speech resources for South Africa’s official languages is of vital concern for research and development in the field of language technology. The data is important not only for the development of tools for facilitation of communication between different language groups but also for empowering local languages for use in modern communication systems. The South African Centre for Digital Language Resources (SADiLaR) has taken on this crucial guardian role through its Language Resource Repository. To date, it contains hundreds of items in multiple languages which are available to the public through an open-access platform.
“SADiLaR’s Language Resource Repository has over 400 records of items in multiple languages, even a few languages from outside South Africa,” says Dr Friedel Wolff, SADiLaR’s Technical Manager. “Some of the items themselves describe a resource that is itself multilingual or, for example, software that supports several languages. Not every resource in your language might interest you, but it might just be what some researcher or software engineer needs to build something exciting for your language.”
Giving permanence to resources
The various types of available resources range from electronic text and speech data (such as domain-specific text collections, wordlists, dictionaries, translation memories and aligned multilingual corpora) to multimodal resources and tools, applications and platforms that support the processing of data and development of new technologies.
According to Wolff, the research data stored in SADiLaR’s repository is of immeasurable value to researchers. “Much of the research data on the repository was costly and time-consuming to create. Some required expert knowledge or computing power that few of us have access to,” he comments. “The repository makes these available to anyone who is interested, and the idea with repositories like these is that the repository should outlive any specific research topic, researcher’s interest or industry fad – in other words, it tries to give some permanence to these resources. Providing this permanence is maybe too hard and tedious for many of the creators, and not always easy to justify in their place of employment. This provides a centralised access point, without trying to take away any of the credit to the people who put the work into creating them,” he explains.
Central point of access
Dr Benito Trollip, a digital humanities researcher at SADiLaR, and enthusiastic user and contributor to the repository, echoes the above. “The SADiLaR Language Resource Repository provides a (in principle) permanent platform for the availing of linguistic data to the broader community (that includes not only researchers). It takes one curious person to see what is out there for less well-known languages and they start developing useful technology,” says Trollip.
When it comes to the repository being a central point of access, Trollip emphasizes how difficult it can be to utilise existing linguistic data source if it, or information about it (is of a sensitive nature), is not made available.
“It often took a lot of time and hard work to generate and curate that data. In my humble opinion, we should move away from the mindset of owning, developing and using data solely for our own gain or professional and financial benefit, and rush toward a mindset of sharing data to enable and empower the community at large,” he says.
Integral tool
Dr Laurette Marais, manager of SADiLaR’s speech node at the Council for Scientific and Industrial Research (CSIR), and her team have experienced the advantages of SADilaR’s repository as both contributors and users: they shared their valuable resources with others, which enabled the development of commercial products, and also benefited by accessing resources that they did not create themselves.
“For the CSIR Voice Computing research group, also known as the Speech Node of SADiLaR, the Resource Repository has become an integral tool in the planning and execution of our research agenda, both as a reliable venue for sharing the data that we gather and produce, but also as a first port of call when we require language resources for our projects. A notable contribution of ours to the repository was high-quality speech data from our Lwazi 3 project, which we have also used to develop our commercial suite of TTS voices, named Qfrency,” says Marais.
“We have in the past and still are contributing speech data aimed at training automatic speech recognition (ASR) systems. Furthermore, the repository has served as an essential source when we require text data in any of the South African languages. I believe that any student or researcher in language technology in South Africa should be familiar with the repository and what it has to offer, especially given the resource scarce nature of our languages.”
A short history
Interestingly, the repository actually predates SADiLaR. It was launched in 2012 by the North-West University’s Centre for Text Technology as the Resource Management Agency (RMA) with funding from the Department of Arts and Culture’s National Centre for Human Language Technologies. When SADiLaR was launched in 2019 with the support of the Department of Science and Innovation (following an incubation and development phase since 2016), the RMA was incorporated in SADiLaR’s Language Resource Repository. SADiLaR took over full responsibility for the curation and maintenance of the repository thereafter.
Submit a resource
If you have developed a language resource and wish to make it usable and/or discoverable for others, SADiLaR’s repository is an excellent option. It is a secure environment with the correct licensing procedures for anyone with research data in the fields of languages, humanities and social sciences. For more information on how to submit a resource, please visit the SADiLaR Resource Guidelines page.
(Written by Birgit Ottermann)
