Repositories pivotal for language preservation

Two researchers from the South African Centre for Digital Language Resources (SADiLaR) share their findings on the existence, use and importance of language repositories in the latest issue of the Journal of the Digital Humanities Association of Southern Africa (DHASA), a peer-reviewed open-access journal of DHASA.

In their article ‘Resource Repositories and linking resources: An exploratory study', Dr Benito Trollip and Ms. Mmasibidi Setaka compare the different types of available repositories in South Africa and highlight their pivotal role for the development, preservation, and advancement of languages.

“Language resources have proved to be essential tools for research and development,” says Setaka. “They allow the acquisition, preparation, collection, management, and customisation of different types of datasets, and are inclusive of spoken and written work, computational tools, lexicographic resources and terminology databases,” she explains.

For their study, the researchers focused on two types of language repositories: Institutional Repositories and Language Resource Repositories.

Ms Mmasibidi Setaka: SADiLaR’s Digital Humanities Researcher

Dr Benito Trollip: SADiLaR’s Digital Humanities Researcher

Institutional repositories

Institutional repositories are sprouting all over the world as a result of the availability of scholarly resources in digital formats, and in response to open-access laws and regulations.

According to Trollip, researchers are increasingly realizing the importance of making their research findable online with an online storage facility.

“This aligns closely with the guiding principles of FAIR (Findable, Accessible, Interoperable and Reusable) and CARE (Collective benefit, Authority to Control, Responsibility, and Ethics) data that emphasize not only the need for effective stewardship and accessibility to all, but also the obligation to uphold ethical standards towards the people the data has been acquired from,” he comments.

While many of South Africa's universities have institutional repositories, most of them only host research outputs typically linked to them e.g., academic articles, theses or dissertations. African indigenous languages are often badly represented in these repositories with some universities having no African language content in their repositories at all. Additionally, when compared to the other BRICS countries, South Africa rates lowest with the least number of open-access repositories. This warrants more research into institutional repositories, Setaka and Trollip note.

Language resource repositories

The fact that information is available online is not a guarantee that the resources are or will be used by researchers or other interested persons, especially if they are not aware of their existence. This is where language resource repositories come in: having a host site that links available resources and a repository where resources can be uploaded, which makes research findable and accessible while also preserving it.

Setaka and Trollip cite SADiLaR's repository as a prime example. This repository contains a range of datasets and applications, downloadable or at the very least findable via relevant metadata or contact information of the responsible people or organisations. The focus in this repository is on South African languages and tools and resources developed for them. As of 29 August 2022, the repository contains a total of 406 assets.

Directly linking with the type of repository SADiLaR hosts, is Lanfrica – a linking site that acts as a search engine for African language resources. Importantly though, Lanfrica’s inventory includes links to a broad range of research outputs that encompasses the type of resources in institutional repositories, sources in popular media, as well as resources found in language resource repositories. Resources are therefore not available for download on Lanfrica’s website, but links to those resources are provided. Lanfrica’s scope is much broader than that of SADiLaR in as far as all African languages and their resources are relevant.

A great benefit of language resource repositories is that it provides greater visibility to low-resource languages such as African languages which often lack digital representation and scholarship. A challenge, however, is a potential lack of awareness and visibility, often as a result of incomplete metadata. Incorrect metadata could lead to researchers not being able to find resources, even if they are aware of them. It would also need the buy-in or participation by the research community.

In their conclusion, Setaka and Trollip suggest further study in data sharing incentives and, more specific with reference to the South African context, determining what is keeping institutes or researchers from sharing their research data.


Read the article here:

(Written by Birgit Ottermann)