SADiLaR Projects

African Wordnet and Multilingual Linguistic Terminology

The African Wordnet (AfWN) and Multilingual Linguistic Terminology project is two-pronged and concerns the development of language resources in the form of wordnets for the South African languages as well as the development of linguistics terminology for nine South African languages. 

Project Start Date: 1 October 2017
Project Status: Phase 1 & 2 complete; Phase 3 in progress

READ MORE…


Communicative Development Inventories for all South Africa’s eleven official languages

The aim of this project is to collect and digitize data on children’s language development from 8 to 30 months and from these data construct and validate Communicative Development Inventories (COIs), which are parent completed questionnaires (for infants 8-18 months and toddlers 16-30 months) about children’s vocabulary, gesture and grammatical abilities for all official South African languages: Setswana, Sesotho, isiXhosa, Xitsonga, Afrikaans, Sesotho sa Leboa, Tshivenda, isiNdebele, Siswati, isiZulu and South African English.

Project Start date: 1 January 2018
Project Status: Phase 1 complete; Phase 2 in progress

READ MORE…


Corpus and system development for automatic captioning of official speeches

The primary aim of the proposed project is to create a corpus of automatically transcribed government speeches. The CSIR proposes to start with the current president (Mr Cyril Ramaphosa) and then expand the corpus with speeches made by previous presidents and/or other members of parliament. A secondary aim is to initiate the development of an automatic speech recognition system that could serve as a first step towards addressing the need for automatic captioning expressed by GCIS.

Project Start Date: 1 April 2020 
Project Status: Project in progress

READ MORE…


Development of a multi-level, multi-genre learner corpus academic writing

Development of a multi-genre, multi-level learner corpus of academic writing in order to develop, refine and implement an online academic writing tool.

Project Start Date: 1 March 2017
Project Status: Project completed

READ MORE…


Digitisation of Language resources

Building language resources for the indigenous South African languages through digitization of language and language related text, audio, online and video data. This project entails the continuation of mass digitisation of all 11 official languages of South Africa. Digitisation will also include digital resources for specific needs and projects.

Project Start Date: 1 April 2017
Project Status: Project ongoing

READ MORE…


Enabling localised language technology applications: A Computational Wide coverage resource grammar for isiZulu

The CSIR node of SADiLaR recently completed a project with as its main aim to deliver to the research community a high-quality, computational, wide coverage resource grammar (WCRG) for isiZulu.  WCRGs unlock opportunities for the South African languages to participate in multilingual research, nationally and internationally.

Project Start Date: 1 April 2021
Project Status: Project completed

READ MORE…


Escalator

This programme will ensure the sustainability of the community by contributing to the development of leaders at the public universities and research centres. Through the interventions, both champions and other community members will build skills and confidence in using digital tools and methodologies in their own research and teaching. The program will align closely with other institutional, regional, national and international digital capacity and community development as well as infrastructure initiatives.

Project Start Date: 1 December 2020
Project Status: Project in progress

READ MORE…


Exploring fair and unbiased testing

Creation of a Protocol for fair and unbiased testing

Project Start Date: 1 March 2017
Project Status: Project completed


Harvesting existing sources of speech data for HTL development in South Africa

The aim of the project is to explore different possibilities for the (semi-) automatic harvesting of existing sources of speech data to create resources that can be used to develop new and improve on existing speech technologies. Ultimately the aim of the project is to enlarge the size of the existing speech corpora for all South Africa’s official languages. This will entail the collection of appropriate speech and text data for L1 to L6, enabling the development of baseline ASR systems, followed by the development and release of automatically transcribed speech data and updated harvesting procedures for the remaining languages (L7 to L11).

Project Start Date: 1 April 2018
Project Status: Project completed

READ MORE…


Health Resources in the South African Languages

A systematic review of available health resources available for the South African Languages, culminating in an index of health resources. A wide range of resources form part of the index, including screening questionnaires, diagnostic assessments, and intervention programmes designed for health professionals.

Project Start Date: 1 November 2018
Project Status: Project completed

READ MORE…


Human Language Technologies Audit 2017/2018

This project aims to provide information on the current state of HL T R&D in South Africa. Specifically, to replicate the HL T audit completed in 2009 and to update the information on the various HL T tools, resources and applications identified in the 2009 audit. The tools, resources and applications developed since 2009 will be identified and categorised using a more updated version of the technology matrix previously employed.

Project Start Date: 1 July 2017
Project Status: Project completed

READ MORE…


Linguistic corpus enrichment for conjunctively written South African languages

This project was developed under the Nodes Specialisation Project, makes linguistically enriched corpora available for the four official South African languages with a conjunctive orthography, i.e. isiNdebele, isiXhosa, isiZulu, and Siswati.

Project Start Date: 1 October 2017
Project Status: Project completed

READ MORE…


Mobile Dictionary application framework

The project aims to develop an open-source hybrid mobile application framework that will allow for online access to a TMS and dictionary API, managed through a TMS API manager (TAM) and offline access to local dictionary content. The framework will create a shared codebase supporting the deployment of both Android and iOS apps from their respective marketplaces. This framework will expand access to dictionaries to allow users to not only gain online access to dictionaries but also provide users with an option to store dictionary content in a local database on mobile devices.

Project Start Date: 1 August 2020
Project Status: Project in progress 

READ MORE…


Multimedia Digital Corpus of siPhûthî

A multimodal digital corpus of siPhûtî as spoken in South Africa and Lesotho. 

The compilation of a multimodal corpus of siPhûtî recordings containing narratives, conversations, interviews, folktales, oral histories and poems is a central feature of the project. The audio and video recordings are transcribed, translated, and annotated. The corpus covers a wide range of topics and includes recordings from a large number of speakers from different generations and geographic locations. This corpus is due to be completed in 2024 and will be made available in the SADiLaR repository.

The siPhûtî corpus is being developed to serve as a resource for community members as well as academics from various disciplines. The corpus provides insights not only for linguists, but also for historians, geographers, and cultural anthropologists. Most importantly, it also serves as a cultural and historical memory for community members.

Project Start Date: 1 August 2019
Project Status: Project in progress

READ MORE…


Parallel corpora for English-isiXhosa and English-Siswati

This project entails the collection and processing of bilingual data for the development of English–isiXhosa and English–Siswati corpora.

Project Start Date: 1 July 2019
Project Status: Project completed 

READ MORE…


Spoken data corpus for Afrikaans, Setswana, Sesotho sa Leboa

The phonetics and phonology of Coloured Afrikaans have as yet barely received any serious attention. This is largely due to the lack of adequate spoken data corpora. Without it, no complete and reliable acoustic descriptions are possible. In relation to this, satisfactory sociolinguistic studies also are unlikely. The main aim of this project is the filling of this gap. The first phase of the project will focus on Coloured Afrikaans. Subsequent projects are planned for Setswana and Sesotho sa Leboa.

Project Start Date: 1 January 2020
Project Status: Project completed

READ MORE…


Through the lens Ex achina: using NLP and statistical learning methods to model eyewitness statements and choosing behavior

The primary aim of this project is to develop and put to trial a new, innovative way of analysing and using eyewitness statements and descriptions to predict eyewitness identification performance. This has not been done before with natural language processing or machine learning methods, and this could solve the current difficulty of analysing large quantities of verbal data.

Project Start Date: 1 November 2018
Project Status: Project completed

READ MORE…


Towards multilingual academic literacy testing for Secondary and Higher Education

Develop a translation protocol for academic literacy tests (this protocol will also consider the possibility of bias in translations) and translate and refine academic literacy tests for the following languages English, Afrikaans, isiXhosa, isiZulu, Setswana and Sesotho

Project Start Date: 1 January 2020
Project Status: Project in progress

READ MORE…


Tracing History Trust: VC Daghregister transcription project

The project is exclusively focussed on the digitisation and transcription of a number of VOC Journals (VC series) of the Cape of Good Hope, vested in the Western Cape Archives and Record Services, Cape Town, in order to make linguistic information available in the public domain. The main purpose of the project is to make available in the public domain historical linguistic material (in particular Afrikaans and Dutch) offering numerous examples of diachronic and synchronic importance which are contained in documents of the Dutch East-India Company (VOC). Of the entire period (1651-1795) the years 1671 to 1679 will be completed during this phase of the project.

Project Start Date: 1 October 2018
Project Status: Project completed

READ MORE…