Parallel corpora for English-isiXhosa and English-Siswati

Project Type: Node
Project Start Date: 1 July 2019
Project Status: Completed and delivered

English-Siswati corpus

Project Aims:

This project entailed the collection and processing of bilingual data to develop a 2-million-word English–Siswati parallel-aligned corpus that can be used to train machine translation systems. The data was acquired by crawling various South African web domains and human translation, both sources accounting for roughly 50% of the final corpus.

A 1,5-million-word monolingual corpus for Siswati was also created and packaged with the parallel corpus as an additional value-added deliverable.

Project Deliverables:

2 million words parallel corpus English-Siswati
1,5 million words monolingual corpus Siswati

English-isiXhosa corpus

Project Aims:

In this project, a 1,85-million-word parallel corpus for English-isiXhosa was developed. The bulk of the data (80%) was collected from various South African (mainly government) web domains. The remainder of the data contains data sourced for the DSAC-funded Autshumato project that was not previously released. The corpus is aligned on sentence level and can be used for machine translation system development.

In addition, a 2,5-million-word monolingual isiXhosa corpus has also been made available.

Project Deliverables:

1,85 million words parallel corpus for English-isiXhosa
2,5 million words monolingual corpus isiXhosa

Contact details:

Please contact ctext@nwu.ac.za

Parallel corpora for English-isiXhosa and English-Siswati

Our Sponsor

Our Partners