Project Type: Node
Project Start Date: 1 July 2019
Project Status: Completed and delivered
English-Siswati corpus
Project Aims:
This project entailed the collection and processing of bilingual data to develop a 2-million-word English–Siswati parallel-aligned corpus that can be used to train machine translation systems. The data was acquired by crawling various South African web domains and human translation, both sources accounting for roughly 50% of the final corpus.
A 1,5-million-word monolingual corpus for Siswati was also created and packaged with the parallel corpus as an additional value-added deliverable.
Project Deliverables:
- 2 million words parallel corpus English-Siswati
- 1,5 million words monolingual corpus Siswati
English-isiXhosa corpus
Project Aims:
In this project, a 1,85-million-word parallel corpus for English-isiXhosa was developed. The bulk of the data (80%) was collected from various South African (mainly government) web domains. The remainder of the data contains data sourced for the DSAC-funded Autshumato project that was not previously released. The corpus is aligned on sentence level and can be used for machine translation system development.
In addition, a 2,5-million-word monolingual isiXhosa corpus has also been made available.
Project Deliverables:
- 1,85 million words parallel corpus for English-isiXhosa
- 2,5 million words monolingual corpus isiXhosa
Contact details:
Please contact ctext@nwu.ac.za