The Fifth workshop on Resources for African Indigenous Languages (RAIL)

Fifth workshop on Resources for African Indigenous Language (RAIL)

Colocated with LREC-COLING 2024

Conference dates: 20-25 May 2024

Workshop date: 25 May 2024

Venue: Lingotto Conference Centre, Torino (Italy)

The fifth RAIL workshop website:

LREC-COLING 2024 website:

Submission website:

The fifth Resources for African Indigenous Languages (RAIL) workshop will be co-located with LREC-COLING 2024 in Lingotto Conference Centre, Torino, Italy on 25 May 2024. The RAIL workshop is an interdisciplinary platform for researchers working on resources (data collections, tools, etc.) specifically targeted towards African indigenous languages. In particular, it aims to create the conditions for the emergence of a scientific community of practice that focuses on data, as well as computational linguistic tools specifically designed for or applied to indigenous languages found in Africa.

Many African languages are under-resourced while only a few of them are somewhat better resourced. These languages often share interesting properties such as writing systems, or tone, making them different from most high-resourced languages. From a computational perspective, these languages lack enough corpora to undertake high level development of Human Language Technologies (HLT) and Natural Language Processing (NLP) tools, which in turn impedes the development of African languages in these areas. During previous workshops, it has become clear that the problems and solutions presented are not only applicable to African languages but are also relevant to many other low-resource languages. Because these languages share similar challenges, this workshop provides researchers with opportunities to work collaboratively on issues of language resource development and learn from each other.

The RAIL workshop has several aims. First, the workshop brings together researchers who work on African indigenous languages, forming a community of practice for people working on indigenous languages. Second, the workshop aims to reveal currently unknown or unpublished existing resources (corpora, NLP tools, and applications), resulting in a better overview of the current state-of-the-art, and also allows for discussions on novel, desired resources for future research in this area. Third, it enhances sharing of knowledge on the development of low-resource languages. Finally, it enables discussions on how to improve the quality as well as availability of the resources.

The workshop has “Creating resources for less-resourced languages” as its theme, but submissions on any topic related to properties of African indigenous languages (including non-African languages) may be accepted. Suggested topics include (but are not limited to) the following:

  • Digital representations of linguistic structures
  • Descriptions of corpora or other data sets of African indigenous languages
  • Building resources for (under resourced) African indigenous languages
  • Developing and using African indigenous languages in the digital age
  • Effectiveness of digital technologies for the development of African indigenous languages
  • Revealing unknown or unpublished existing resources for African indigenous languages
  • Developing desired resources for African indigenous languages
  • Improving quality, availability and accessibility of African indigenous language resources


09:00-09:05 Opening
09:05-09:30 Doing Phonetics in the Rift Valley: Sound Systems of Maasai, Iraqw and Hadza; Alain Ghio, Didier Demolin, Michael Karani and Yohann Meynadier
09:30-09:55 Kallaama: A Transcribed Speech Dataset about Agriculture in the Three Most Widely Spoken Languages in Senegal; Elodie Gauthier, Aminata Ndiaye and Abdoulaye Guissé
09:55-10:20 Long-Form Recordings to Study Children’s Language Input and Output in Under-Resourced Contexts; Joseph R. Coffey and Alejandrina Cristia
10:20-10:30 Developing Bilingual English-Setswana Datasets for Space Domain; Tebatso G. Moape, Sunday Olusegun Ojo and Oludayo O. Olugbara
10:30-11:00 Coffee break
11:00-11:25 Compiling a List of Frequently Used Setswana Words for Developing Readability Measures; Johannes Sibeko
11:25-11:50 A Qualitative Inquiry into the South African Language Identifier’s Performance on YouTube Comments; Nkazimlo N. Ngcungca, Johannes Sibeko and Sharon Rudman
11:50-12:15 The First Universal Dependency Treebank for Tswana: Tswana-Popapolelo; Tanja Gaustad, Ansu Berg, Rigardt Pretorius and Roald Eiselen
12:15-12:40 Adapting Nine Traditional Text Readability Measures into Sesotho; Johannes Sibeko and Menno van Zaanen
12:40-13:05 Bootstrapping Syntactic Resources from isiZulu to Siswati; Laurette Marais, Laurette Pretorius and Lionel Clive Posthumus
13:05-14:20 Lunch break
14:20-14:45 Early Child Language Resources and Corpora Developed in Nine African Languages by the SADiLaR Child Language Development Node; Michelle J. White, Frenette Southwood and Sefela Londiwe Yalala
14:45-15:10 Morphological Synthesizer for Ge’ez Language: Addressing Morphological Complexity and Resource Limitations; Gebrearegawi Gebremariam Gidey, Hailay Kidu Teklehaymanot and Gebregewergs Mezgebe Atsbha
15:10-15:35 EthioMT: Parallel Corpus for Low-resource Ethiopian Languages; Atnafu Lambebo Tonja, Olga Kolesnikova, Alexander Gelbukh and Jugal Kalita
15:35-16:00 Resources for Annotating Hate Speech in Social Media Platforms Used in Ethiopia: A Novel Lexicon and Labelling Scheme; Nuhu Ibrahim, Felicity Mulford, Matt Lawrence and Riza Batista-Navarro
16:00-16:30 Coffee break
16:30-16:55 Low Resource Question Answering: An Amharic Benchmarking Dataset; Tilahun Abedissa Taffa, Ricardo Usbeck and Yaregal Assabie
16:55-17:05 The Annotators Agree to Not Agree on the Fine-grained Annotation of Hate-speech against Women in Algerian Dialect Comments; Imane Guellil, Yousra Houichi, Sara Chennoufi, Mohamed Boubred, Anfal Yousra Boucetta and Faical Azouaou
17:05-17:30 Advancing Language Diversity and Inclusion: Towards a Neural Network-based Spell Checker and Correction for Wolof; Thierno Ibrahima Cissé and Fatiha Sadat
17:30-17:55 Lateral Inversions, Word Form/Order, Unnamed Grammatical Entities and Ambiguities in the Constituency Parsing and Annotation of the Igala Syntax through the English Language; Mahmud Mohammed Momoh
17:55-18:00 Closing

Submission requirements:

We invite papers on original, unpublished work related to the topics of the workshop. Submissions, presenting completed work, may consist of up to eight (8) pages of content for a long submission and up to four (4) pages of content for a short submission plus additional pages of references. The final camera-ready version of accepted long papers are allowed one additional page of content (up to 9 pages) so that reviewers’ feedback can be incorporated. Papers should be formatted according to the LREC-COLING style sheet (, which is provided on the LREC-COLING 2024 website ( Reviewing is double-blind, so make sure to anonymise your submission (e.g., do not provide author names, affiliations, project names, etc.) Limit the amount of self citations (anonymised citations should not be used). The RAIL workshop follows the LREC-COLING submission requirements.

Please submit papers in PDF format to the START account ( Accepted papers will be published in proceedings linked to the LREC-COLING conference.

Important dates:

Submission deadline: 28 February 2024 AoE (UPDATED)

Date of notification: 15 March 2024

Camera ready deadline: 29 March 2024

RAIL workshop: 25 May 2024

Organising Committee

Rooweither Mabuya, South African Centre for Digital Language Resources (SADiLaR), South Africa

Muzi Matfunjwa, South African Centre for Digital Language Resources (SADiLaR), South Africa

Mmasibidi Setaka, South African Centre for Digital Language Resources (SADiLaR), South Africa

Menno van Zaanen, South African Centre for Digital Language Resources (SADiLaR), South Africa