Projects
Autshumato 6
Maintenance on the following Autshumato software systems:
The Integrated Translation Environment (ITE), the Terminology Management System (TMS) and the MT Web service (MTWS). Maintenance on the existing Autshumato MT systems (English into Afrikaans, isiZulu, Sepedi, Setswana, Sesotho, and Xitsonga) including processing of additional data acquired since the previous projects, and retraining and optimising the systems with the newest methods; and research on, and development of MT systems for automatic translation from Afrikaans, isiNdebele, isiZulu, Sepedi, Setswana, Sesotho, Tshivenḓa and Xitsonga into English.
Communicative Development Inventories for South African Languages: Phase Two
Southern African Communicative Development Inventories for six languages
The addendum seeks to gather data on 1200 children’s language development using communicative development inventories for six South African languages IsiZulu, IsiNdebele, Sesotho sa Leboa, South African English, Siswati and Tshivenda from children aged 8-30 months to be digitally stored at SADiLaR and made freely available to scientists and involved in language.
Online and open-access Southern African Encyclopedia of Music and Sound.
The aim of the study is to lay the groundwork for the long-term facilitation and compilation of (effectively an open-ended process) an online and open-access Southern African Encyclopedia of Music and Sound.
Enabling Localised Language Technology Applications: A Computational-Wide Coverage
Resource Grammar for isiZulu
The aim of this project is to develop a Computational Wide Coverage Resource Grammar (WCRG) for IsiZulu and to make it available to the research community in a variety of ways.
African Wordnet and Multilingual Literary Terminology Development
The African Wordnet (AfWN) and Multilingual Literary Terminology Development project concerns the development of language resources in the form of wordnets and a literary term bank for various African languages.
Corpora of spoken language
Corpora of spoken language for Sesotho, Setswana and Sepedi
Parallel corpora for English into isiXhosa
Development of parallel data sets between English and isiXhosa.
Collaboration with The Carpentries
In collaboration with The Carpentries to teach foundational coding and data science skills to researchers in South Africa
Audit on Language Resources at Higher Education Institutions (HEls)
The aim of the project was to carry out an analysis of all the language resources and policies currently in use at public higher education institutions in South Africa. This has been done by conducting an audit at HEI and reporting on the findings
Extended digitisation of language resources
Building language resources for the indigenous South African languages through digitisation of language and language-related text, audio, online and video data. Digitisation will also include digital resources for specific needs and projects.
Multilingualism in the Online Writing Centre
This project aims to identify strategies that can be used to enhance multilingualism and/or interaction during writing centre consultations (FTF and Online) that are conducted in a student’s L2 (English).
Word Embeddings Analytics for South African Media analysis: Toward the construction of a Polarization Barometer
The aim of the project is to develop algorithmic and mathematical tools that extract representations of individuals’ opinions from embeddings trained on SA Media data and to use these tools for the development of a news-based indicator of polarization of South African society. Our contributions will help researchers understand, predict and possibly influence the mechanisms behind the dynamics of polarization in South Africa
Word Embeddings for South African Languages
The aim of the project is to create word embeddings for three South African languages using the limited data that is available for these languages. The word embeddings will leverage existing resources to create new resources that can be used in the development of various core technologies to the benefit of the South African languages.
Corpora of spoken language for Sesotho, Setswana and Sepedi
The primary aim of the project is to create a publicly available (Open Access) set of multi-language, multi-variety corpora for Sesotho, Setswana and Sepedi.
Enabling Localised Language Technology Applications Phase II: Developing Nguni Computational Grammars and Resources
The purpose of this project is to use the existing isiZulu Grammatical Framework (GF) resource grammar to not only extend computational resources for isiZulu, but to bootstrap two new resource grammars within the Nguni language family, namely isiXhosa and Siswati, thus enabling the development of similar computational resources for these two new languages.
Multilingual parallel subtitling and dubbing as part of discipline-specific literacy in faculties at two South African universities
Subtitle and dub a selection of English discipline-specific (animal anatomy and physiology) videos in four languages (Sepedi, Zulu, isiXhosa and Afrikaans).
Linguistic corpus enrichment for South African languages
In this project, we aim to convert and extend the enriched corpora for the four official South African languages with a conjunctive orthography, i.e. isiNdebele (NR), isiXhosa (XH), isiZulu (ZU), and Siswati (SS), as well as for the five disjunctively written languages, i.e. Sesotho sa Leboa (NSO), Sesotho (ST), Setswana (N), Tshivenda (VE), and Xitsonga (TS). The project consists of two phases, viz. the conversion of existing NCHLT annotated data and the extension of part of speech annotated corpora with data from different text types.
Data Harvesting Phase 3: Harvesting existing sources of speech data for HLT development in South Africa
The proposed objective for Phase 3 of the project, therefore, updates the Phase 2 objectives, building on the Phase 2 extension refined transcription techniques. The Phase 3 objective includes limited collection of spoken audio (existing datasets and new sources, adding to the inventory as this can be accommodated) for these four languages; developing corpora of transcribed speech data from the existing datasets; processing the data with applicable tools; data curation, packing and release of the data to SADiLaR which will extend the project to include 10 official languages of South Africa for speech technology development.
USAf National Language Resources Audit 2023
The aim of this project is to carry out an analysis of all the language resources and policies currently in use at public Higher Education Institutions in South Africa. This will be done by conducting an audit at the HEIs and reporting on the findings.
ǂKhomani San | Hugh Brody Collection
Language verification, orthography standardisation, and development of metadata for all photographs in the collection. This includes the standardisation throughout the collection of traditional names transcribed in N|uu, Kora, or Khoekhoegowab (i.e. in instances where a photograph has a description of places or names of people, these will be transcribed in the original source language using modern orthography practices) using standardized Unicode throughout the collection to make the content uniform and digitally findable.
Corpus and System Development for Automatic Captioning of Government Speeches.
The primary aim of the proposed project is to create a corpus of automatically transcribed government speeches. The CSIR proposes to start with the current president (Mr Cyril Ramaphosa) and then expand the corpus with speeches made by previous presidents and/or other members of parliament. A secondary aim is to initiate the development of an automatic speech recognition system that could serve as a first step towards addressing the need for automatic captioning expressed by GCIS.
Digitisation of Reel-to-Reel Cassette Tapes from the Hidden Years Archive
This project aims to digitise a selection of audio-visual material from the Hidden Years Music Archive hosted at the Africa Open Institute at Stellenbosch University and to make it available as Open Access content on the SADiLaR repository.
Digitisation of language resources
Building language resources for the Indigenous South African languages through digitisation of language and language-related data
Digital dictionary collaboration project with African Tongue
Digital Dictionary Resources for N|uu: The last living South African San Language.
Spelling checkers for 10 South African languages
This project aims to develop and make spelling checkers available for 10 South African languages.
Linguistic corpus enrichment for conjunctively written South African languages.
This project aims to develop enriched corpora for the four official South African languages with a conjunctive orthography, i.e. IsiNdebele (NR), isiXhosa (XH), isiZulu (ZU), and Siswati (SS). The corpora will consist of approximately 50,000 tokens, parallel on sentence level, with English as source language, for each language.
Pictorial language representation for Sepedi: Language access for persons with severe communication disabilities
The aim of the project is to develop a research-informed picture-based vocabulary package for Sepedi/Sesotho sa Leboa. This vocabulary package will be designed in a way that will allow it to be incorporated into an electronic and paper-based ACC system that allows for a measure of novel utterance generation.
Communicative Development Inventories for all South Africa’s eleven official languages – Phase Two
Communicative development inventories for Sesotho sa Leboa, Tshivenḓa, isiNdebele, Siswati, isiZulu and South African English
Data Harvesting Extension: Phase 2, Harvesting existing sources of speech data for HLT development in South Africa
The objective of phase 2 of the project, was to identify potential sources of spoken audio(existing datasets and new) for two additional languages per year; design and implement data capturing tools and procedures for these languages; tools of consider existing tools for these four languages; and curate. package and release them to SADILAR for speech technology development in South Africa.
Phase 2: Harvesting existing sources of speech data for HLT development in South Africa
The aim of the project is to develop speech resources in a (semi-automatic manner (for L3, L4, L5 and L6) based on the Phase 1 feasibility study. This will entail the collection of appropriate speech and text data for L3, L4, L5 and L6, enabling the development of baseline ASR systems, followed by the development and release of automatically transcribed speech data and updated harvesting procedures for the remaining languages (L7 to L11). All data collected through the harvesting platform will be released under the least restrictive license possible, given the origin of the data, with CC BY 4.0 International as the default, and more restrictive data under CC BY-NC 4.0 International as the most restrictive license.
Multilingual parallel subtitling and dubbing as part of discipline-specific literacy in faculties at two South African universities
Subtitle and dub a selection of English discipline-specific (animal anatomy and physiology) videos in four languages (Sepedi, isiZulu, isiXhosa and Afrikaans)
Compiling a Child Speech Database
Compiling a child speech database for South African context: Speech samples of typically developing Afrikaans and Sesotho sa Leboa speaking children.
Homicide Media Tracker and Database
The aim of this project is to create a functional, methodologically sound, user-friendly database, tool, and platform that will enable deep, rich, and broad research into media-based content – specifically news reports of fatal violence – together with the population of an actual homicide news database, which will facilitate the open sharing of this students and researchers
Writing Centre Project
Multilingualism in the Online Writing Centre
VC Daghregister transcription project: Phase 2
Digitisation and transcription of VOC journals of the Cape of Good Hope, vested in the Western Cape Archives and record services in Cape Town.
Dictionary portal for South African languages
Generation of clear protocols relating to the digitisation project that will enable the building of language resources for the official South African languages through the digitisation of relevant analogue text, graphic, audio and video data.
Parallel corpora for English-Siswati
Collect and process bilingual data for the development of an English-Siswati parallel corpus with a size of 1,650,000 words.
Digitisation of language resources
Generation of clear protocols relating to digitisation project that will enable to build of language resources for the official South African languages through the digitisation of relevant analogue text, graphic, audio and video data.
Expansion and further refinement of a multi-level, multi-genre learner corpus of academic writing
Collect data from various universities in South Africa to grow the multi-level, multi-genre learner corpus of academic writing.
Towards creating multilingual wordlists for the academic context
These academic wordlists serve as a resource for students to better understand words in the academic context
Corpus and system development project
Corpus and System Development for Automatic Captioning of Official Government Speeches
Digitisation of language resources
Extended digitisation of language resources. Building language resources for South African languages through digitisation of language and language-related text, audio, online, and video data.
Corpus and system development for wide coverage resource grammar
Development of wide coverage resource grammar for isiZulu, to make it available to the research community in a variety of ways.
African Wordnet development: Phase 2
Development of language resources for official South African Languages.
Multilingual digital corpus project
A multilingual digital corpus of Síphùthì as spoken in South Africa and Lesotho.
Spoken data corpus project
Spoken data corpus created for Afrikaans, Setswana, and Sesotho sa Leboa.
Through the lens Ex Machina: Using NLP and statistical learning methods to model eyewitnesses statements
The primary aim of this project is to develop and put to trial a new, innovative way of analysing and using eyewitness statements and descriptions to predict eyewitness identification performance. This has not been done before with natural language processing or machine learning methods and could solve the current difficulty of analysing large quantities of verbal data.
Health Resources in South African languages
A systematic review of health resources available for official South African languages, culminating in an index of health resources.
Communicative Development Inventories for South Africa’s Eleven Official Written Languages – Phase Two
Create digital communicative inventories for South African languages consisting of gestural, lexical, and syntactic data for children per language as well as spontaneous speech data.
Multilingual parallel subtitling and dubbing as part of discipline-specific literacy in faculties at two South African universities
Develop a translation protocol for academic literacy tests (this protocol will also consider the possibility of bias in translations).