Intelligent Script Identification Systems in the context of indian Languages

Acharya, Dinesh U (2008) Intelligent Script Identification Systems in the context of indian Languages. Phd. Thesis thesis, Manipal Institute of Technology, Manipal.

[img] PDF
Dinesh Acharya.pdf - Submitted Version
Restricted to Registered users only

Download (68MB) | Request a copy

Abstract

It is desirable to have machines capable of handl ing inputs in a variety of forms such as printed/handwritten paper documents, speech etc. In a multi-lingual country like India, which has many languages with their own distinctive scripts and rich literary traditions, it is particularly important to develop computer systems that allow users to interact with them in Indian languages. In the context of document image analysis, what we need are techniques for analyzing and understanding printed/handwritten documents in Indian languages. Due to the peculiarities of Indian scripts (and languages), solutions that work well for languages such as English would not be applicable, in their totality, for Indian languages. In fact, sufficient research work is not reported on Indian language character recognition. Most of the pieces of existing work are concerned about Devanagari and Bangia script characters, the two most popular languages in India. Some studies are reported on the recognition of other languages like Tamil, Telugu, Oriya, Kannada, Panjabi, Gujrathi, etc.. Structural and topological features based tree classifier, and neural network classifiers are mainly used for the recognition of Indian scripts. Further, in the Indian context, many documents would contain text of more than one script (fur example, English, Hindi and the local language), and hence recognition and segmentation of different scripts from a multi-lingual document is also an important problem. At present, several organizations have started working on Indian languages optical character recognition(OCR). Ministry of Information Technology, Government of India, has initiated a Technology Development on Indian Languages (TDIL) project under which OCR system development for most of the important Indian language scripts have been taken up by different labs and academic institutions The thesis includes the development of OCRs for two south Indian scripts, Malayalam and Kannada. The different components of an OCR include preprocessing (segmentation, feature extraction) and classification. The image acquisition stage, which is usually executed via the use of a digital scanner is one of the first operations performed is preprocessing. It includes binarization and noise removal. Then, the digitized image is binarized using histogram-based thresholding approach. The threshold value is chosen as the midpoint between two histogram peaks. Median filtering is used to remove the noise in the binarized image. It is evident from the literature that there are two particularly important and at the same time complicated components of the character recognition process. First the segmentation or separation of characters and the second one is feature extraction. In document analysis, when the word "segmentation" is used, it may be attributed to line, word or character segmentation. Character segmentation is fundamental to character recognition approaches which rely on isolated characters. It is a critical step because incorrectly segmented characters are not likely to be correctly recognized. The segmentation in Malayalam and Kannada OCRs uses projection profile technique and zoning algorithm along with connected component analysis to segment the characters. The binarized image is processed to lines and words using appropriate horizontal and vertical projection profiles. For Malayalam and Kannada characters the projection profile approach alone will not give the desired output, as individual characters are comprised of combinations of left (Malayalam), right top, bottom modifiers with the base consonants. Overlapping of modifiers with the base consonant, in forming a valid character, causes additional difficulty in segmentation. Hence, a two-stage segmentation approach is followed, where zone level features (reference points) are extracted in the first stage and then in second stage, these reference points are used for connected component analysis in segmenting the characters. It is suggested that the key to high performance is through the ability to select and utilize the distinctive features of characters. Feature extraction can be defined as the process of extracting distinctive information from the matrices of digitized characters. There are two main categories of features: Global and Structural. Malayalam OCR uses simple global feature of the character, i.e., the complete pixel representation of the character, with normalization to handle characters of multiple size. In Kannada OCR, different structural and topological features like presence of shirorekha, presence of holes, its number, position and size with respect to character size, number of connected components, number of zero crossings etc. are used in the first stage of classification. The direction code frequency is used in the second stage. Majority of the Malayalam and Kannada characters are composed of a base consonant and a vowel modifier orland consonant conjunct. Hence the recognition of characters is done in two stages. First the classification of character components is done, followed by combining the proper sequence of character components into a complete character using a lexicon. Row sum and Column sum based template matching and backpropagation neural networks are used for the classification in Malayalam OCR. In the case of Kannada OCR, binary tree classifier is used in the first stage. If a leaf node of the tree contains only one character component, then that is the classification result. However, if it represents a group of character components, then the final classification is done by a nearest neighbor classifier in the second stage. Next, we considered the recognition of text including only numerals. First the automation of postal pin code application is considered. In the Indian context, this includes script identification along with numeral recognition. Furhter, we are exclusively addressing the handwritten Kannada numeral recognition using different set of features and classification techniques. For the multilingual pin code reader and handwritten Kannada numeral recognition, we have considered only the isolated numerals. Hence, the horizontal and vertical projection profile based methods provide the required numeral segmentation. The features extracted in the case of multi-lingual pin code reader are invariant moments, height deviation of numerals in the pin code, aspect ratio and average gap of the numeral from the surrounding minimal bounding box. Classification is based on thresholding and nearest neighbor classifier. Following the script identification, a simple template matching technique is used for digit identification. Image pyramid, rather than the image itself, is used to speed up the template matching. Different features and classification methods are experimented exclusively on handwritten Kannada numeral recognition. Different stroke based features with both supervised and unsupervised classifiers, in single and multi-stage configuration, were experimented. Seven segment concept of English digits representation is extended to ten segments to represent some unique features of Kannada numerals. The feature is used along with a decision tree classifier in machine printed numeral recognition. This feature set is used along with other category of features(water reservoir, horizontal and vertical strokes and end points), which are complementary in nature, in handwritten Kannada numeral recognition. The combined feature set is given to a k-means cluster for classification owever, it is observed that when different types of features are combined into the same feature vector, some large-scaled features may dominate the distance, while the other features do not have the same impact on the classification. Instead, separate classifiers can be used to classify based on each visual feature individually. The final classification can be obtained based on the combination of separate base classification results. A modified classification result vector (CRV) method is experimented. Fuzzy k-nearest neighbor, used as the base classifier for each of the feature sets, which gives the fuzzy membership values of each numeral class. These fuzzy membership values provide better information in resolving conflicts between the results of different base classifiers. The final classification based on combined classification result vector is done by k-nearest neighbor classifier in the second stage. The modified CRY based system is further modified as hybrid architecture. It combines both supervised and unsupervised principles of classification. The fuzzy k-nearest neighbor and fuzzy c-means mechanisms are used respectively. The hybrid architecture is robust and performs better than single principle alone.

Item Type: Thesis (Phd. Thesis)
Subjects: Engineering > MIT Manipal > Computer Science and Engineering
Depositing User: MIT Library
Date Deposited: 03 Dec 2014 09:14
Last Modified: 03 Dec 2014 09:14
URI: http://eprints.manipal.edu/id/eprint/141083

Actions (login required)

View Item View Item