Using the infrastructure

One of the central aims of SADiLaR is to provide a sustainable, widely available set of language resources that supports research in various domains where digital collections of language data are required. With this aim in mind, SADiLaR offers a variety of services that support the curation, development, distribution, and maintenance of language resources. SADiLaR provides easy and sustainable access to digital language data as well as language processing and analysis tools.

SADiLaR and its partner nodes also offer various support activities to help researchers and developers to create and distribute language resources outside of the curated resources developed by SADiLaR and its nodes.

As the scope of SADiLaR broadens over the lifetime of the project, more services will be made available via this website.

Persistent identifiers

A facility that allows for the creation and registration of an electronic resource that can be referenced, through a unique identifier that does not reference a URL, which is unstable, and ensures the long-term accessibility to resources.

Language data and technology repository

A digital index of language resources that are available for South African languages from various research and private institutions, both nationally and internationally. All Language Resource Index items contain metadata, including developer details, specifications, and contact information.

Researchers that have data sets available can register their resources, digital or otherwise, with the index through the Resource procedures.

A digital collection of language resources, in various modalities, that are available for download from SADiLaR.

Data providers who would like to distribute their resources on the SADiLaR site can upload metadata and resources via the repository. All resources are reviewed by SADiLaR before being made available.

Language data search environment

  • A service that allows for online searches in the corpora available from SADiLaR, and includes the following functionality:
    • Key word in context searches;
    • Word and frequency list generation;
    • Filtering of data based on metadata annotations;
    • Part-of-speech and lemma-based searches.
  • African Wordnets, a hierarchical ontology of words and concepts with synonym and antonym sets, including usage examples.

Core technologies

Online versions of so-called core technologies that support the automatic processing of language data for the South African languages, including:

  • tokenisers;
  • part-of-speech taggers;
  • named-entity recognisers;
  • language identification; and
  • phrase chunkers.