Technical Papers (Centre for Research on Bangla Language Processing)

Permanent URI for this collectionhttps://hdl.handle.net/10361/638

Browse

Recent Submissions

Now showing 1 - 20 of 23
  • listelement.badge.dso-type Item ,
    Text to speech for Bangla language using festival
    (BRAC University, 2007) Alam, Firoj; Nath, Promila Kanti; Khan, Mumit; Center for Research on Bangla Language Processing (CRBLP), BRAC University
    In this paper, we present a Text to Speech (TTS) synthesis system for Bangla language using the opensource Festival TTS engine. Festival is a complete TTS synthesis system, with components supporting front-end processing of the input text, language modeling, and speech synthesis using its signal processing module. The Bangla TTS system proposed here, creates the voice data for festival, and additionally extends festival using its embedded scheme scripting interface to incorporate Bangla language support. Festival is a concatenative TTS system using diphone or other unit selection speech units. Our TTS implementation uses two different kinds of these concatenative methods supported in Festival: unit selection and multisyn unit selection. The modules of such a TTS system are described in this paper, followed by an evaluation of the quality of synthesized speech for acceptability and intelligibility.
  • listelement.badge.dso-type Item ,
    Skew angle detection of bangla script using radon transform
    (BRAC University, 2006) Habib, S. M. Murtoza; Noor, Nawsher Ahamed; Khan, Mumit; Center for Research on Bangla Language Processing (CRBLP), BRAC University
    Skew angle detection and correction an integral part of any OCR system. Without proper skew correction, the performance of an OCR will simply not be acceptable for most scanned images. We propose an innovative method for skew angle detection and correction for Bangla scripts using the Radon Transform. The basic idea is to identify the upper envelope by detecting the headline that accompanies most of the letters in the Bangla script, and then apply the Radon Transform to this upper envelope to get the skew angle. Once the angle is known, the correction is quite trivial to perform. While the current implementation handles only a single skew angle per text image, it can be extended to handle multiple skew angles by partitioning the document image.
  • listelement.badge.dso-type Item ,
    Research report on Bengla tagset
    (BRAC University, 2007) Mahmud, Altaf; Khan, Mumit; Center for Research on Bangla Language Processing (CRBLP), BRAC University
    This report describes the design of a POS tagset for Bangla, based on the Penn Treebank design. The resulting tagset contains 53 morpho-syntactic tags.
  • listelement.badge.dso-type Item ,
    Research report on Bengla tagged lexicon
    (BRAC University, 2007) Hayder, Kamrul; Islam, Md Zahurul; Khan, Mumit; Center for Research on Bangla Language Processing (CRBLP), BRAC University
    This report describes the design and implementation of a Bangla tagged lexicon. The resulting lexicon contains 144,770 entries, out of which 58,145 are verbs. The tags used in the lexicon are reproduced here from a previous report on theBangla tagset.
  • listelement.badge.dso-type Item ,
    Research report on Bangla optical character recognition using Kohonen network
    (BRAC University, 2007) Shatil, Adnan Md. Shoeb; Center for Research on Bangla Language Processing (CRBLP), BRAC University
    This report discusses the theory and implementation of an Optical Character Recognition (OCR) for Bangla. The principal idea is to convert images of text documents such as those obtained from scanning a document into editable texts. This report does not address the pre-processing steps such as skew correction and noise reduction (which is handled in a previous report), so the documents are assumed to pre-processed by another tool in the pipeline. For training and recognition, the input is then first converted to a binary image, and then into to a 25x25 pixel2 image; the only feature extracted from the images is a 625-bit long vector, which is then trained or classified using a Kohonen neural network. The OCR shows excellent performance for documents with single typeface. The work in progress is extending it to handle multiple typefaces.
  • listelement.badge.dso-type Item ,
    Research report on Bengla OCR training and testing methods
    (BRAC University, 2007) Hasnat, Md. Abul; Center for Research on Bangla Language Processing (CRBLP), BRAC University
    In this paper we present the training and recognition mechanism of a Hidden Markov Model (HMM) based multi-font Optical Character Recognition (OCR) system for Bengali character. In our approach, the central idea is to separate the HMM model for each segmented character or word. The system uses HTK toolkit for data preparation, model training and recognition. The Features of each trained character are calculated by applying the Discrete Cosine Transform (DCT) to each pixel value of the character image where the image is divided into several frames according to its size. The extracted features of each frame are used as discrete probability distributions which will be given as input parameters to each HMM model. In the case of recognition, a model for each separated character or word is built up using the same approach. This model is given to the HTK toolkit to perform the recognition using the Viterbi Decoding method. The experimental results show significant performance over models using neural network based training and recognition systems.
  • listelement.badge.dso-type Item ,
    Research report on Bengla lexicon
    (BRAC University, 2007) Hayder, Kamrul; Center for Research on Bangla Language Processing (CRBLP), BRAC University
    We report on the compilation of a comprehensive Bangla word list lexicon. The current list contains 80,969 words from the Standard Chalita Bhasha (SCB) vocabulary. The word list is currently being used by the BRAC University Bangla Spelling Checker application.
  • listelement.badge.dso-type Item ,
    Research report on Bengla Verb and Noun Morphological analysis
    (BRAC University, 2007) Islam, Md. Zahurul; Center for Research on Bangla Language Processing (CRBLP), BRAC University
    This report describes the inflection Bangla verb and noun morphology and rules, lexicons and grammar for Bangla morphological analysis.
  • listelement.badge.dso-type Item ,
    Optical character recognition for Bangla documents using HMM
    (BRAC University, 2007) Monjel, Md. Sheemam; Khan, Mumit; Center for Research on Bangla Language Processing (CRBLP), BRAC University
    In this paper we have described an OCR program made for Bangla documents. This program uses HMM for the recognition process. The description of full OCR program is too large to present here. So, we have given emphasis on the important and Bangla specific processes of the OCR pro-gram. We have defined some features of Bangla characters and described their extraction process.
  • listelement.badge.dso-type Item ,
    Integrating Bangla computing support in openoffice.org
    (BRAC University, 2007) Sarkar, Asif Iqbal; Pavel, Dewan Shahriar Hossain; Khan, Mumit; Center for Research on Bangla Language Processing (CRBLP), BRAC University
    This paper addresses the issues of Integrating Bangla Computing support for OpenOffice.org office suite and in the process, Identifies and describes the different problems associated with OpenOffice.org and that should be solved in order to get optimum performance in Bangla Script rendering and computing. The paper also discusses the integration and methodology of a number of small applications that have been developed to work with OpenOffice.org and the problems related to this issue.
  • listelement.badge.dso-type Item ,
    Automatic Bangla corpus creation
    (BRAC University, 2007) Sarkar, Asif Iqbal; Pavel, Dewan Shahriar Hossain; Khan, Mumit; Center for Research on Bangla Language Processing (CRBLP), BRAC University
    This paper addresses the issue of automatic Bangla corpus creation, which will significantly help the processes of Lexicon development, Morphological Analysis, Automatic Parts of Speech Detection and Automatic grammar Extraction and machine translation. The plan is to collect all free Bangla documents on the World Wide Web and offline documents available and extract all the words in them to make a huge repository of text. This body of text or corpus will be used for several purposes of Bangla language processing after it is converted to Unicode text. The conversion process is also one of the associated and equally important research and development issue. Among several procedures our research focuses on a combination of font and language detection and Unicode conversion of retrieved Bangla text as a solution for automatic Bangla corpus creation and the methodology has been described in the paper.
  • listelement.badge.dso-type Item ,
    A comprehensive Bangla spelling checker
    (BRAC University, 2006) Naushad UzZaman; Khan, Mumit; Center for Research on Bangla Language Processing (CRBLP), BRAC University
    We present a comprehensive Bangla spelling checker that improves the quality of suggestions for misspelled words. The complex rules for Bangla spelling presents a significant challenge in producing suggestions for a misspelled word when employing the traditional methods; one must take phonetic similarity into account for suggested alternatives to be reasonably accurate. In Bangla there are several algorithms available for spell checking, however, none of these considers the complex orthographic rules of Bangla. As a result, spelling checker application does not perform well. In this paper, we describe the process of checking the spelling of a Bangla document (i.e. detecting misspelled words, generating suggestions for misspelled word, and ranking the suggestions), compare the methodologies with existing solutions available in the literature, and then propose solutions for each step. Finally, we conclude by showing the performance and evaluation of our proposed solution.
  • listelement.badge.dso-type Item ,
    A survey on script segmentation for Bangla OCR
    (Center for research on Bangla language processing (CRBLP), BRAC University, 2007) Abduallah, Arif Billah Al-Mahmud; Khan, Mumit; Center for Research on Bangla Language Processing (CRBLP), BRAC University
    Script segmentation is an important primary task for any Optical Character Recognition (OCR) software. Especially, in case of off-line OCR for printed character, it has more importance. Through script segmentation a big image of some written document is fragmented into a number of small pieces which are then used for pattern matching to determine the expected sequence of characters. In the implementation of Bangla OCR, the script segmentation may also play a vital role. But, for accurate and proper segmentation it is necessary to identify the properties of Bangla script as well as the exceptions. This paper depicts the most important and useful properties, advantages, disadvantages of various Bangla scripts, especially the printed scripts. It also gives some ideas regarding the prospective field of Bangla OCR and its applications.
  • listelement.badge.dso-type Item ,
    Text normalization system for Bangla
    (BRAC University, 2008) Alam, Firoj; Habib, S. M. Murtoza; Khan, Mumit; Center for Research on Bangla Language Processing (CRBLP), BRAC University
    This paper describes a process of text normalization system of Bangla language (exonym: Bengali) by identifying the semiotic classes from Bangla text corpus. After identifying the semiotic classes a set of rules were written for tokenization and verbalization. This study is important for Text-To-Speech (TTS) system and as well as in language model for speech recognition.
  • listelement.badge.dso-type Item ,
    Research report on Bengali NLP engine for TTS
    (BRAC University, 2008-04-07) Alam, Firoj; Center for Research on Bangla Language Processing (CRBLP), BRAC University
    This report describes the Bengali NLP processor for TTS, along with the challenges faced in developing the NLP processor.
  • listelement.badge.dso-type Item ,
    Research report on parallel corpus translation challenges and processes
    (BRAC University, 2007-10-08) Khan, Mumit; Center for Research on Bangla Language Processing (CRBLP), BRAC University
    We describe some of the challenges in developing English-Bangla parallel corpora, and look some of the established processes used by other language corpora for solutions to some of these challenges.
  • listelement.badge.dso-type Item ,
    Research report on Translations of gTLDs and ccTLDs in Bangla
    (BRAC University, 2007-10-08) Alam, Firoj; Habib, Murtoza; Hayder, Kamrul; Khan, Mumit; Khan, Mumit; Center for Research on Bangla Language Processing (CRBLP), BRAC University
    This report describes the initial translations of gTLDs and ccTLDs in Bengali, along with the challenges faced in creating the translations.
  • listelement.badge.dso-type Item ,
    Research report on Bangla wordnet development challenges and solutions
    (BRAC University, 2007-10-08) Khan, Mumit; Center for Research on Bangla Language Processing (CRBLP), BRAC University
    We describe the initial design of Bangla WordNet (BWN), based on the English WordNet 2.2 distribution from Princeton University. Our goal is to develop a 5,000 entry Bangla WordNet over the next two years. At present, we are focusing on translating the English Ontology, and so far have created about 250 entries. We describe some of the challenges and potential solutions to these challenges in creating the basic WordNet structure, as well as in mapping the senses from English.
  • listelement.badge.dso-type Item ,
    Acoustic analysis of Bangla vowel inventory
    (BRAC University, 2008) Alam, Firoj; Habib, S. M. Murtoza; Khan, Mumit; Center for Research on Bangla Language Processing (CRBLP), BRAC University
    This paper describes the acoustic characteristics of Bangla vowels, obtained by analyzing the recordings of male and female voices. First, the duration of each phoneme was identified by averaging both the male and female voice data; then, formants were analyzed for all the phonemes and finally vowel phoneme inventory was designed and presented in this paper.
  • listelement.badge.dso-type Item ,
    Acoutstic analysis of Bangla consonants
    (BRAC University, 2008) Alam, Firoj; Habib, S. M. Murtoza; Khan, Mumit; Center for Research on Bangla Language Processing (CRBLP), BRAC University
    This paper describes the acoustic characteristics of Bangla consonants, obtained by analyzing the recordings of male and female voices. First, the duration of each phoneme was identified by averaging both the male and female voice data; then, formant were measured and formant comparison was made for controversial phonemes, which also served to resolve the controversies in the existing phoneme inventories; and finally, a consonant phoneme inventory was designed.