Project Type: SADiLaR Node – CSIR Speech Node
Project Start Date: 1 April 2020
Project Status: In progress
Project Aims:
The primary aim of the proposed project is to create a corpus of automatically transcribed government speeches. The CSIR proposes to start with the current president (Mr Cyril Ramaphosa) and then expand the corpus with speeches made by previous presidents and/or other members of parliament. A secondary aim is to initiate the development of an automatic speech recognition system that could serve as a first step towards addressing the need for automatic captioning expressed by GCIS.
Project Deliverables:
- Resources transferred from GCIS archive will be utilised to produce the following:
- Evaluation data set (5 hours in total)
- Report on ASR performance evaluation
- Corpus and related documentation transferred to SADiLaR (Depending on the availability of speeches from the GCIS archive the project will provide) approximately 10 hours of speech per year spanning a 7-year period which should yield approximately 100 hours of speech in total. This corpus will be released under a non-commercial, non- exclusive, research license, as GCIS is the proprietary owner thereof
- Research outputs describing the released corpus, the acoustic analysis and findings
- The baseline captioning system.