Activities
2024 Conference presentations
- AFRILEX – Role of SADiLaR: advice on formats, backups, licensing, software, platforms; Community of Practice – Reflections (Mr Juan Steyn and Dr Friedel Wolff) – PDF Presentation
- AFRILEX – Making sense of kuningi using a corpus linguistic analysis (Prof Langa Khumalo) – PDF Presentation
- AFRILEX – Corpus-based dictionaries for low-resource languages (Ms Mmasibidi Setaka & Prof Menno van Zaanen) – PDF Presentation
- Global AI Conference – Multilingualism: A case of the South African Centre for Digital Language Resources in developing language resources (Ms Rooweither Mabuya) – PDF Presentation
- Global Virtual Forum Summer School – African Digital Humanities and the Ethics of AI (Ms Andiswa Bukula) – PDF Presentation
- Southern African Folkore Society – Digitisation as a Catalyst for Preserving Xhosa Oral Literature and Histories in the Age of Artificial Intelligence (Ms Andiswa Bukula) – PDF Presentation
2024 Conference workshops and tutorials and other events
- UCT Language Indaba– Language Resources as Enablers (Prof Langa Khumalo) – PDF Presentation
- SADC Open Science Workshop – “Democratising Knowledge through Open Science“ (Prof Langa Khumalo) – PDF Presentation
- Pre-conference Workshop – Towards a sustainable National Term Bank for the official languages of South Africa: Collaboration vs Fragmentation (Prof Justus Roux, Prof Rufus Gouws,….) – PDF Presentation
- DH-IGNITE @ ALASA 2024– (Ms Jessica Mabaso, Dr Muzi Matfunjwa, Dr Respect Mlambo, Ms Rooweither Mabuya) – PDF Presentation
- Pre-conference Workshop (SAALT)- Assessment literacy and the matter of enhancing translation practices of assessment tools (Prof Tobie Van Dyk) – PDF Presentation
- SALALS Pre-conference Workshop – Introduction to Text Analysis Tools (Dr Muzi Matfunjwa and Dr Respect Mlambo)
DH Colloquia
SADiLaR organizes monthly Digital Humanities colloquia. These typically take place on Wednesdays (in the middle of the month) from 10:00 to 11:00 SAST. During these DH colloquia a wide variety of topics are discussed, mostly on content related to Digital Humanities, sometimes focusing more on the techniques or methodologies used, sometimes focusing more on the applications or application areas.
The DH colloquia are part of Escalator’s Explorer track. You can find more information on Escalator here: https://escalator.sadilar.org/, on Escalator’s championship programme here: https://escalator.sadilar.org/champions/overview/, and on the Explorer track within Escalator’s championship programme here: https://escalator.sadilar.org/champions/explorer/. Also check out the other tracks within the Escalator championship programme as there may be tracks directly related to your interests. If you want to be a member of the Digital Humanities community, you may also want to consider joining the DHCSSza Slack. This page will provide more information on how to join (this is also free): https://escalator.sadilar.org/connect/.
If you have suggestions for speakers at the DH colloquium (or if you want to speak yourself), or if you want to provide feedback, please do not hesitate to contact Prof Menno van Zaanen: menno.vanzaanen@nwu.ac.za.
- Andreas Baumann – Frequent words are semantically more stable than rare ones: what computational modeling, corpus analysis, and psycholinguistic databases can tell us about lexico-semantic change (2 September 2024)
- Tim Brookes – Writing Beyond Writing (14 August 2024)
- Rory du Plessis – “Are they human or are they data?” Digital archives and the creation of humanising stories (17 July 2024)
- Maciej Ogrodniczuk – Universal Discourse: Towards a multilingual model of discourse relations (12 June 2024)
- Johannes Sibeko – Is it written to be read? A case of readability in Sesotho (15 May 2024)
- Iris Auda and Pule kaJanolintji – isiBheqe: First additional script Language Pedagogy in African Digital Orthographies — The case of isiBheqe soHlamvu digital tools for use in language and linguistics learning (10 April 2024)
- Robyn Berghoff and Emanuel Bylund – What do we study when we study multilingualism? A bibliometric(-adjacent) analysis of the field (13 March 2024)
- Hanél Duvenage – Data in healthcare: efforts digitisation and digitalisation (21 February 2024)
- Phillip Ströbel – Innovating Historical Scholarship: The Bullinger Digital Project (31 January 2024)
SWiP Events
- SWiP Writing Competition – The competition provides an opportunity for both new and existing editors to engage in content creation while fostering a sense of community and collaboration.
- Two-day authorship workshops – Conducted across 6 regions targeting 10 universities and training 20 participants per university.
- SWiP side event & exhibition at SFSA2024 – Preserving Languages & Scientific Information: Accessible Knowledge for All
- SWiP Project Launch – The event introduced a collaboration aiming to preserve African languages and open up access to scientific information in South Africa.
Projects
Autshumato 6
Maintenance on the following Autshumato software systems:
The Integrated Translation Environment (ITE), the Terminology Management System (TMS) and the MT Web service (MTWS). Maintenance on the existing Autshumato MT systems (English into Afrikaans, isiZulu, Sepedi, Setswana, Sesotho, and Xitsonga) including processing of additional data acquired since the previous projects, and retraining and optimising the systems with the newest methods; and research on, and development of MT systems for automatic translation from Afrikaans, isiNdebele, isiZulu, Sepedi, Setswana, Sesotho, Tshivenḓa and Xitsonga into English.
Communicative Development Inventories for South African Languages: Phase Two
Southern African Communicative Development Inventories for six languages
The addendum seeks to gather data on 1200 children’s language development using communicative development inventories for six South African languages IsiZulu, IsiNdebele, Sesotho sa Leboa, South African English, Siswati and Tshivenda from children aged 8-30 months to be digitally stored at SADiLaR and made freely available to scientists and involved in language.
Online and open-access Southern African Encyclopedia of Music and Sound.
The aim of the study is to lay the groundwork for the long-term facilitation and compilation of (effectively an open-ended process) an online and open-access Southern African Encyclopedia of Music and Sound.
Enabling Localised Language Technology Applications: A Computational-Wide Coverage
Resource Grammar for isiZulu
The aim of this project is to develop a Computational Wide Coverage Resource Grammar (WCRG) for IsiZulu and to make it available to the research community in a variety of ways.
African Wordnet and Multilingual Literary Terminology Development
The African Wordnet (AfWN) and Multilingual Literary Terminology Development project concerns the development of language resources in the form of wordnets and a literary term bank for various African languages.
Corpora of spoken language
Corpora of spoken language for Sesotho, Setswana and Sepedi
Parallel corpora for English into isiXhosa
Development of parallel data sets between English and isiXhosa.
Collaboration with The Carpentries
In collaboration with The Carpentries to teach foundational coding and data science skills to researchers in South Africa
Audit on Language Resources at Higher Education Institutions (HEls)
The aim of the project was to carry out an analysis of all the language resources and policies currently in use at public higher education institutions in South Africa. This has been done by conducting an audit at HEI and reporting on the findings
Extended digitisation of language resources
Building language resources for the indigenous South African languages through digitisation of language and language-related text, audio, online and video data. Digitisation will also include digital resources for specific needs and projects.
Multilingualism in the Online Writing Centre
This project aims to identify strategies that can be used to enhance multilingualism and/or interaction during writing centre consultations (FTF and Online) that are conducted in a student’s L2 (English).
Word Embeddings Analytics for South African Media analysis: Toward the construction of a Polarization Barometer
The aim of the project is to develop algorithmic and mathematical tools that extract representations of individuals’ opinions from embeddings trained on SA Media data and to use these tools for the development of a news-based indicator of polarization of South African society. Our contributions will help researchers understand, predict and possibly influence the mechanisms behind the dynamics of polarization in South Africa
Word Embeddings for South African Languages
The aim of the project is to create word embeddings for three South African languages using the limited data that is available for these languages. The word embeddings will leverage existing resources to create new resources that can be used in the development of various core technologies to the benefit of the South African languages.
Corpora of spoken language for Sesotho, Setswana and Sepedi
The primary aim of the project is to create a publicly available (Open Access) set of multi-language, multi-variety corpora for Sesotho, Setswana and Sepedi.
Enabling Localised Language Technology Applications Phase II: Developing Nguni Computational Grammars and Resources
The purpose of this project is to use the existing isiZulu Grammatical Framework (GF) resource grammar to not only extend computational resources for isiZulu, but to bootstrap two new resource grammars within the Nguni language family, namely isiXhosa and Siswati, thus enabling the development of similar computational resources for these two new languages.
Multilingual parallel subtitling and dubbing as part of discipline-specific literacy in faculties at two South African universities
Subtitle and dub a selection of English discipline-specific (animal anatomy and physiology) videos in four languages (Sepedi, Zulu, isiXhosa and Afrikaans).
Linguistic corpus enrichment for South African languages
In this project, we aim to convert and extend the enriched corpora for the four official South African languages with a conjunctive orthography, i.e. isiNdebele (NR), isiXhosa (XH), isiZulu (ZU), and Siswati (SS), as well as for the five disjunctively written languages, i.e. Sesotho sa Leboa (NSO), Sesotho (ST), Setswana (N), Tshivenda (VE), and Xitsonga (TS). The project consists of two phases, viz. the conversion of existing NCHLT annotated data and the extension of part of speech annotated corpora with data from different text types.
Data Harvesting Phase 3: Harvesting existing sources of speech data for HLT development in South Africa
The proposed objective for Phase 3 of the project, therefore, updates the Phase 2 objectives, building on the Phase 2 extension refined transcription techniques. The Phase 3 objective includes limited collection of spoken audio (existing datasets and new sources, adding to the inventory as this can be accommodated) for these four languages; developing corpora of transcribed speech data from the existing datasets; processing the data with applicable tools; data curation, packing and release of the data to SADiLaR which will extend the project to include 10 official languages of South Africa for speech technology development.
USAf National Language Resources Audit 2023
The aim of this project is to carry out an analysis of all the language resources and policies currently in use at public Higher Education Institutions in South Africa. This will be done by conducting an audit at the HEIs and reporting on the findings.
ǂKhomani San | Hugh Brody Collection
Language verification, orthography standardisation, and development of metadata for all photographs in the collection. This includes the standardisation throughout the collection of traditional names transcribed in N|uu, Kora, or Khoekhoegowab (i.e. in instances where a photograph has a description of places or names of people, these will be transcribed in the original source language using modern orthography practices) using standardized Unicode throughout the collection to make the content uniform and digitally findable.
Corpus and System Development for Automatic Captioning of Government Speeches.
The primary aim of the proposed project is to create a corpus of automatically transcribed government speeches. The CSIR proposes to start with the current president (Mr Cyril Ramaphosa) and then expand the corpus with speeches made by previous presidents and/or other members of parliament. A secondary aim is to initiate the development of an automatic speech recognition system that could serve as a first step towards addressing the need for automatic captioning expressed by GCIS.
Digitisation of Reel-to-Reel Cassette Tapes from the Hidden Years Archive
This project aims to digitise a selection of audio-visual material from the Hidden Years Music Archive hosted at the Africa Open Institute at Stellenbosch University and to make it available as Open Access content on the SADiLaR repository.
Digitisation of language resources
Building language resources for the Indigenous South African languages through digitisation of language and language-related data
Digital dictionary collaboration project with African Tongue
Digital Dictionary Resources for N|uu: The last living South African San Language.
Spelling checkers for 10 South African languages
This project aims to develop and make spelling checkers available for 10 South African languages.
Linguistic corpus enrichment for conjunctively written South African languages.
This project aims to develop enriched corpora for the four official South African languages with a conjunctive orthography, i.e. IsiNdebele (NR), isiXhosa (XH), isiZulu (ZU), and Siswati (SS). The corpora will consist of approximately 50,000 tokens, parallel on sentence level, with English as source language, for each language.
Pictorial language representation for Sepedi: Language access for persons with severe communication disabilities
The aim of the project is to develop a research-informed picture-based vocabulary package for Sepedi/Sesotho sa Leboa. This vocabulary package will be designed in a way that will allow it to be incorporated into an electronic and paper-based ACC system that allows for a measure of novel utterance generation.
Communicative Development Inventories for all South Africa’s eleven official languages – Phase Two
Communicative development inventories for Sesotho sa Leboa, Tshivenḓa, isiNdebele, Siswati, isiZulu and South African English
Data Harvesting Extension: Phase 2, Harvesting existing sources of speech data for HLT development in South Africa
The objective of phase 2 of the project, was to identify potential sources of spoken audio(existing datasets and new) for two additional languages per year; design and implement data capturing tools and procedures for these languages; tools of consider existing tools for these four languages; and curate. package and release them to SADILAR for speech technology development in South Africa.
Phase 2: Harvesting existing sources of speech data for HLT development in South Africa
The aim of the project is to develop speech resources in a (semi-automatic manner (for L3, L4, L5 and L6) based on the Phase 1 feasibility study. This will entail the collection of appropriate speech and text data for L3, L4, L5 and L6, enabling the development of baseline ASR systems, followed by the development and release of automatically transcribed speech data and updated harvesting procedures for the remaining languages (L7 to L11). All data collected through the harvesting platform will be released under the least restrictive license possible, given the origin of the data, with CC BY 4.0 International as the default, and more restrictive data under CC BY-NC 4.0 International as the most restrictive license.
Multilingual parallel subtitling and dubbing as part of discipline-specific literacy in faculties at two South African universities
Subtitle and dub a selection of English discipline-specific (animal anatomy and physiology) videos in four languages (Sepedi, isiZulu, isiXhosa and Afrikaans)
Compiling a Child Speech Database
Compiling a child speech database for South African context: Speech samples of typically developing Afrikaans and Sesotho sa Leboa speaking children.
Homicide Media Tracker and Database
The aim of this project is to create a functional, methodologically sound, user-friendly database, tool, and platform that will enable deep, rich, and broad research into media-based content – specifically news reports of fatal violence – together with the population of an actual homicide news database, which will facilitate the open sharing of this students and researchers
Writing Centre Project
Multilingualism in the Online Writing Centre
VC Daghregister transcription project: Phase 2
Digitisation and transcription of VOC journals of the Cape of Good Hope, vested in the Western Cape Archives and record services in Cape Town.
Dictionary portal for South African languages
Generation of clear protocols relating to the digitisation project that will enable the building of language resources for the official South African languages through the digitisation of relevant analogue text, graphic, audio and video data.
Parallel corpora for English-Siswati
Collect and process bilingual data for the development of an English-Siswati parallel corpus with a size of 1,650,000 words.
Digitisation of language resources
Generation of clear protocols relating to digitisation project that will enable to build of language resources for the official South African languages through the digitisation of relevant analogue text, graphic, audio and video data.
Expansion and further refinement of a multi-level, multi-genre learner corpus of academic writing
Collect data from various universities in South Africa to grow the multi-level, multi-genre learner corpus of academic writing.
Towards creating multilingual wordlists for the academic context
These academic wordlists serve as a resource for students to better understand words in the academic context
Corpus and system development project
Corpus and System Development for Automatic Captioning of Official Government Speeches
Digitisation of language resources
Extended digitisation of language resources. Building language resources for South African languages through digitisation of language and language-related text, audio, online, and video data.
Corpus and system development for wide coverage resource grammar
Development of wide coverage resource grammar for isiZulu, to make it available to the research community in a variety of ways.
African Wordnet development: Phase 2
Development of language resources for official South African Languages.
Multilingual digital corpus project
A multilingual digital corpus of Síphùthì as spoken in South Africa and Lesotho.
Spoken data corpus project
Spoken data corpus created for Afrikaans, Setswana, and Sesotho sa Leboa.
Through the lens Ex Machina: Using NLP and statistical learning methods to model eyewitnesses statements
The primary aim of this project is to develop and put to trial a new, innovative way of analysing and using eyewitness statements and descriptions to predict eyewitness identification performance. This has not been done before with natural language processing or machine learning methods and could solve the current difficulty of analysing large quantities of verbal data.
Health Resources in South African languages
A systematic review of health resources available for official South African languages, culminating in an index of health resources.
Communicative Development Inventories for South Africa’s Eleven Official Written Languages – Phase Two
Create digital communicative inventories for South African languages consisting of gestural, lexical, and syntactic data for children per language as well as spontaneous speech data.
Multilingual parallel subtitling and dubbing as part of discipline-specific literacy in faculties at two South African universities
Develop a translation protocol for academic literacy tests (this protocol will also consider the possibility of bias in translations).