Enhancing Bangla speech recognition through acoustic and language modeling

Citation

Abstract

This work introduces the performance comparison of two deep ASR models,Wav2Vec2 and Whisper, in recognizing the Bengali language with complex linguistics, dialectal variations, and phonetic intricacies. ASR technology has become crucial in interacting with devices in various applications because it bridges spoken speech to text or commands. Core ASR components include feature extraction, acoustic and language modeling, and decoding. These systematically provide a way of translating speech into actionable formats. Despite the advances, challenges in managing pronunciation diversity, code-switching, and ambient noise continue to face ASR systems, especially in languages like Bengali, which encompass significant phonetic and dialectal variations. Wav2Vec2 and Whisper were selected based on previous successes in other languages, but it was essential to study how efficiently adaptable they would be for Bengali ASR applications. The work was oriented toward sensitivity to the models regarding parameter tuning and generalization capability across linguistic features. Wav2Vec2 showed flexibility, with noticeable improvements in WER by tuning parameters such as learning rate, dropout, and gradient accumulation, which showcases its adaptability to Bengali’s phonetic nuances. Each tuning configuration showed progressive enhancements in model stability and evaluation accuracy, hence positioning Wav2Vec2 as one of the potential candidates for a real-world application in Bengali, which requires precision and flexibility. On the contrary, Whisper had rigidity in tuning: it kept the same WER of 100% for all settings and was thus insensitive to this tuning. So far, this may be a structural limitation within the Whisper model, which influences its performance in high-precision applications involving more linguistically complicated languages like Bengali. Preprocessing and feature extraction to standardize the audio data of both models included tokenization for linguistic alignment, and parallel inferences were performed to compare performance. While Wav2Vec2 showed promises with incremental improvements in transcription accuracy, Whisper’s inability to adapt underscores the need for architectural revisions to meet the demands of Bengali ASR tasks. These results also suggest that Wav2Vec2, by the flexibility in its parameter tuning process, is more capable of handling the linguistic diversity of Bengali. At the same time, Whisper, on the other hand, is purely suitable for standardized languages, where rigidity does not pose any problem. The current paper concludes thatWav2Vec2 should be more appropriate for sophisticated ASR applications in the Bengali language, especially under dialectally diverse or noisy contexts. In contrast, Whisper may require essential changes to become compatible with a linguistically complex setting. This is further compounded by dataset diversity, limitations of computational resources, and model tunability in the present study to mark the importance of specialized ASR frameworks necessary for linguistic diversity inherent in Bengali and similar languages.

Description

Cataloged from PDF version of thesis.
Includes bibliographical references (pages 54-57).
This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science, 2024.

Publisher Link

Type

Thesis