|
 Jaco Badenhorst |
 Peleira Zulu
|
 Neil Kleynhans
|
The Meraka Institute's automatic speech recognition (ASR) team is making headway in its efforts to develop acoustic models for speech recognition in all South African official languages. The speech recognition system forms an integral part of the Lwazi system, a Department of Arts and Culture-funded project that aims to deliver information about government services to all people within South Africa's borders.
The team is made up of PhD student, Neil Kleynhans, who focuses on system development and ensuring its user-friendliness; Master's student Jaco Badenhorst, who is responsible for building the systems; and PhD student Peleira Zulu, who brings in the data for the systems, in collaboration with North-West University.
The ASR systems are developed using two software tools namely, Hidden Markov Model Toolkit (HTK) and ASR-Builder. HTK is used to develop speech acoustic models and run the ASR systems live. ASR-Builder provides a user-friendly front-end to the HTK tools and offers the system designer additional functionality that can be used to reduce the lengthy time needed in developing ASR systems.
Kleynhans says that there is a need for software that makes the transition from having raw speech data and a final ASR system, easier. "ASR-Builder is our contribution in fulfilling this need." Feedback from users within the broader Human Language Technologies group has proved useful in helping them develop the software further and adding more features to make it as user-friendly as possible, he says.
He credits Mark Zsilavecz for laying a solid foundation for the work they are currently doing. "Mark started working on what was then known as HEasy towards the end of 2006, using scripts written by Marius Peche, who helped make the use of HTK easier, but by April 2007 he was the sole developer. What they called HEasy at the time has now been renamed to ASR-Builder."
The core of Badenhorst's responsibility is modelling the acoustics - the sounds that one wishes to capture. He then creates a model for each sound in a language. Several factors come into play when modelling sounds, he explains. "The data that you use to create the models are important and need to be clean and error-free; then you have a large number of speakers and each one of them is different. Finally, what they say is also very important. We would typically need at least four hours of data with 200 different speakers. Research is pointing towards a need to use more than ten hours of data to develop the models properly."
North-West University works with the Meraka team to make recordings in all 11 languages. "It is an enormous task", says Badenhorst, "but we have made good progress in getting fewer errors in the data. As of September, we now have the data for all the languages and have developed tests on the data. We are able to detect noise, errors and other impurities that occur when, for example, people say place names or say a word in a different language."
"When we train a model to represent a sound, three factors are important: who made the sound, how much of the sound he or she has said and, of course, what that sound is," he adds.
Zulu believes that when they complete the work that they are doing for the Lwazi project, various users in both urban and rural areas will benefit as they will be able to access information in their own languages. "Building the systems preserves both the languages and the cultures of the people who speak them. There is no need for indigenous languages to become obsolete as a result of technology. On the contrary, we should be able to use technology to empower people with information in any language."
Apart from their current work for the Lwazi Project, the three young scientists are also pursuing their postgraduate studies in human language technologies. Kleynhans is doing his PhD in channel normalisation for speech systems. Zulu is doing his PhD in language distance, focusing specifically on South African languages, while Badenhorst's Master's research focuses on improving speech recognition by sharing resources across languages.
Enquiries: CSIR Communication
|