Integrating the RMA into SADiLaR with new technologies

Over the past five years the Language Resource Management Agency (RMA) has been the central repository for the distribution and management of language resources, data and software tools, for the official languages of South Africa. The RMA has provided an excellent foundation for SADiLaR to build on. With the knowledge and skills obtained from the RMA project, we are now ready to advance to a new phase, and integrate the RMA into an international platform that will have not only national, but also global impact.

SADiLaR provides a platform to access linguistic data and reuse this data, while also offering researchers technologies and software to simplify linguistic analysis. The main distribution channel for resources will be a repository that allows interested parties to access any of the language resources distributed by SADiLaR. “The repository will also link to larger international infrastructures and language distribution agencies, such as the European Language Resource Association (ELRA) and CLARIN in Europe, and the Language Data Consortium (LDC) in the USA,” says Dr Roald Eiselen, SADiLaR’s technical manager.

The move to an institutional repository for the distribution of language resources has primarily been done for the following four reasons:

  1. to simplify the access and download procedures for users by moving away from the “shopping cart” experience;
  2. to provide all resources with a digital object identifier (DOI), which is integrated into the international digital handle system;
  3. to allow easy integration of the data resources into other repositories and data infrastructures, such as CLARIN and LDC; and
  4. as a first step in the process of getting the “data seal of approval” for SADiLaR, which will give the repository a more solid standing in the data distribution community.

SADiLaR will also make available several research enabling technologies such as:

  • metadata and data processing infrastructures that are specifically linked to particular projects;
  • general language data analytic platforms made available online; and
  • automatic language analysis modules that support the development of more complex language technologies.

Although a substantial number of open-source technologies are reused and adapted to the South African context, several of the technologies and services that are being developed will be new technologies that will be distributed for further use by language communities both in Africa and around the world.

Over the coming year, SADiLaR will expand the set of available language resources on an ongoing basis, while also extending the set of automatic analysis tools that are available via web interfaces. It is expected that these technologies will enable end-users to more easily analyse their own linguistic data, or search and analyse the data available from SADiLaR.