Establishing a new infrastructure

English can be regarded as the de facto language of the digital age due to the rapid development of the internet and communication technology over the last three decades. However, speakers of other languages have taken up the challenge and there has been a concerted effort to develop digital tools and other resources in languages other than English to support people’s need to use their languages on digital platforms.

The South African Centre for Digital Language Resources (SADiLaR) has been established to foster digital research and development growth in the official languages of South Africa. SADiLaR forms part of the Department of Science and Technology’s (DST) new South African Research Infrastructure Roadmap (SARIR), for the large scale development of research capacity in South Africa. SADiLaR is the only program in SARIR that focuses on the humanities, with the remaining research infrastructure projects focusing on various aspects of health and natural sciences.

BACKGROUND

South Africa has joined the international playing field with the DST’s establishment of SARIR as part of their long-term research and development plan. SARIR is intended to provide a strategic, rational, medium- to long-term framework for planning, implementing, monitoring, and evaluating the provision of research infrastructures necessary for a competitive and sustainable national system of innovation. Thirteen research infrastructures from five scientific domains have emerged as concrete and sufficiently conceptualised proposals for inclusion in SARIR.

In 2008, a ministerial advisory committee addressed recommendations to support human language technology (HLT) development at a national level. SADiLaR resulted from this decision, and from the work of various researchers under the guidance of former Director, Prof. Justus Roux. It provides a vision of developing and supporting a multilingual democracy, through access to digital language resources and language technology development. Ten years after the initial decision, SADiLaR is in its first year of incubation as one of SARIR’s research infrastructures, and facilitates an environment for the creation, management, and distribution of digital language resources by offering language data and applicable software that is freely available for research and development purposes for the 11 official South African languages.

The North-West University (NWU) is the host of this multi-partner entity, that has a network of linked nodes consisting of a number of South African universities and agencies (UP, UNISA, CSIR, ICELDA, and CTexT). SADiLaR is the first of its kind in Africa and promotes existing links with similar entities globally, especially with a major counterpart in Europe, the Common Language Resource Infrastructure (CLARIN).

SCOPE OF SADiLAR

Internationally, well-resourced languages have corpora of more than a billion words per language, or thousands of hours of digital speech data, enabling the users of these languages to have access to functional language-based technologies such as automatic speech recognition systems and machine translation applications. Unfortunately, the situation for most of the official languages of South Africa is significantly different, with relatively small data sets available in both the text and speech domains. This in turn limits the opportunities for the development of functional technologies to support the multilingual South African community. The establishment of SADiLaR will enable the future research and development of language technologies in South Africa, as SADiLaR aims to pool computational resources for these purposes.

Beyond the development of technologies that language resources enable, these resources are also a prerequisite for the study of language through digital means, typically encapsulated in the field of digital humanities.

“South Africa is twenty years behind the rest of the world when it comes to the development of Digital Humanities [DH], but SADiLaR ought to enable researchers to develop skills to do DH-related research on par with what is produced internationally,” says Prof Attie de Lange, current Director of SADiLaR.

Furthermore, SADiLaR will, through capacity-building initiatives, promote and support the use of digital data and innovative methodological approaches aiding numerous projects in the domains of Humanities and Social Sciences.

As an enabling agency, SADiLaR will provide training for a new generation of researchers through various workshops by both national and international experts. Workshops will also be held on request and will cover a broad spectrum of topics in the DH and HLT domains, such as digitisation, data standards, use of computational mediated approaches through tools such as R and Python, methods to clean, organise and analyse data, as well as thematic workshops relating to the Humanities and Social Sciences.