Microsoft India announced the availability of Speech Corpus for Indian languages today. Becoming the largest publicly available Indian language speech dataset, Speech Corpus offers speech training and test data for Telugu, Tamil, and Gujarati.
This release is aimed at helping researchers improve speech recognition technology for applications where speech data is used. Speech Corpus is made available by the Microsoft Research Open Data initiative. It’s a collection of free datasets that can be used to extend research in areas like natural language processing, computer vision, and domain-specific sciences. “We believe India’s increasing digital literacy needs to be supported by a multi-lingual digital world.
Microsoft Indian Language Speech Corpus is an extension of our ongoing efforts to reduce language barriers and empower Indians to harness the full potential of the Internet. Using our technology expertise, we want to accelerate innovation in voice based computing for India by supporting researchers and academia,” commented Sundar Srinivasan, General Manager of Artificial Intelligence & Research at Microsoft India.
According to a press release shared by Microsoft, Speech Corpus for Indian languages was tested at Interspeech 2018, which is the world’s largest conference on language processing and the science and technology that drives it. Participants of the Low Resource Speech Recognition challenge used data from Speech Corpus’ Indian languages dataset to build Automatic Speech Recognition (ASR) systems. They reportedly succeeded in creating high-quality speech recognition models using the available data. Microsoft also reports in its press release that there isn’t enough digital data for text, speech, and linguistic resources to build large machine learning models for many vernacular languages across the globe.
The challenge is understandable given how the differences in enunciation, accent, diction, and slang across various regions in India are very subtle. Microsoft believes that the release of Speech Corpus for Indian languages will help in overcoming these differences and in building systems that can connect more easily with users in the future.