Harvesting existing sources of speech data for HTL development in South Africa


Project Type: SADiLaR Node -CSIR Meraka Institute
Project Start Date: 1 April 2018
Project Status: Finalising

Project Aims:

The aim of the project is to explore different possibilities for the (semi-) automatic harvesting of existing sources of speech data to create resources that can be used to develop new and improve on existing speech technologies. Ultimately the aim of the project is to enlarge the size of the existing speech corpora for all South Africa’s official languages. This will entail the collection of appropriate speech and text data for L1 to L6, enabling the development of baseline ASR systems, followed by the development and release of automatically transcribed speech data and updated harvesting procedures for the remaining languages (L7 to L11).

Project Deliverables:

  • Overview report of available spoken audio broadcast sources, identified harvesting strategies and study design
  • NCHLT I word and phone recognition results
  • Auxiliary corpus of 300+ hours for the same speakers as in the NCHLT I corpora (a minimum of 30 hours of data for at least 8 languages)
  • Second auxiliary corpus 200+ hours of additional data (distributed across different languages)
  • Kaldi acoustic models for all 11 languages
  • Baseline ASR systems for at least two languages (L1 and L2) derived from all available data (text and speech)
  • A minimum of 50 hours automatically transcribed data using baseline systems with extended pronunciation dictionaries to cover pronunciations
  • Data Collection (L3 and 4)
  • Baseline ASR systems for L3 and L4 derived from all available L3 and L4 (text and speech)
  • A minimum of 50 hours automatically transcribed data using baseline systems (L 1 – L 4) with extended pronunciation dictionaries to cover pronunciations
  • Harvesting procedure
  • Data Collection (L5 and 6)
  • Baseline ASR systems for L5 and L6 derived from all available L5 and L6 (text and speech)
  • A minimum of 50h automatically transcribed data using baseline systems with extended pronunciation dictionaries to cover pronunciations
  • Research outputs in the form of journal articles and conference proceedings