Complexities written specifically to introduce all the complexities, to

 

Complexities in Urdu Nastaliq OCR

Abstract

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

The
complexity in machine recognition of Arabic language due to its cursive nature
is well known. Urdu is a popular language which is written in Arabic based
script but uses a special calligraphic style of writing known as Nastaliq. The
calligraphic nature of Nastaliq and other linguistic properties of Urdu
introduce many other complexities which must be kept in mind in the development
of OCR. This paper introduces all those complexities and open is-sues which are
unique to Urdu language and Nastaliq style or writing from OCR point of view.

1.
Introduction

Optical Character Recognition (OCR) is a branch
of Pattern Recognition which is used to recognize printed text, normally in
form of digitally scanned images or live text coming from drawing by a user through
some digital input device. Urdu language possess some of the properties which
are considered most challenging in character recognition world. The most common
of them is cursiveness which it inherits from Arabic. The complexities of
recognition of Urdu script are much more than that of Arabic script thus
require much more attention and out of the box thinking. This paper is written
specifically to introduce all the complexities, to the best of my knowledge,
which are unique in recognition of Urdu Nastaliq script.

Cursiveness is the nature of Urdu it means that
characters are joined with each other while written and take a new shape. This
characteristic of Arabic language makes it very di?cult for the machine to segment each character
separately and recognize it. Not every character in Urdu and Arabic connects
with the other characters and some connect only from one side. Some of the
characters in the character set are also used as a diacritic marks. These
include Toy (?) and Hamza ( ?). Separate diacritics are also used in Urdu like
Arabic such as zer (––?), zaber (––?), pesh (––?), shadd (––?) etc but are much
less common than in Arabic text. Dots are also very common and significant. In Urdu
a character may contain up to three dots above, below or inside it. 17 out of
38 characters in Urdu have dots, 10 of which have 1 dot, 2 have 2 dots and 5
characters have 3 dots. Characters in Urdu may also overlap
each other vertically.

Urdu is written in Nastaliq style unlike Arabic/Persian
which are written in Naskh style. Nastaliq is a calligraphic version known for
its beauty which originated by combining two styles, Naskh and Taliq. A less
elaborate version of style is used for writing printed Urdu. The credit of
computerizing Nastaliq goes to Mirza Ahmed Jameel who created 20,000 Nastaliq
ligatures in 1980, ready to be used in computers for printing. He called it
Noori Nastaliq. Many people followed and created their own Nastaliq style fonts
among which Jameel Noori Nastaliq, Alvi Nastaliq and Faiz Lahori Nastaliq are
popular. All the Nastaliq fonts fulfill the basic characteristics of Nastaliq
writing style.

Urdu Optical Character Recognition can be divided
into two major subcategories which are

i)                   
Online

ii)                 
 O?ine.

 O?ine recognition means attempting to recognize
text which is already present in the form of printed or handwritten material.
Thus o?ine recognition can be further divided into two
categories:

i)                   
Printed

ii)                 
 Handwritten.

Online recognitions refers to real time recognition
as user moves the pen two write something. Thus online recognition only
involves handwritten text. Online recognition is considered less complex as
compared to o?ine recognition because in online recognition
temporal information of pen traces are available, which is not the case in o?ine recognition. Most of the people who worked
in Urdu character recognition only attempted to recognize the isolated
characters.

Two major approaches followed for recognition of
complete Urdu text found in the literature are:

i)                   
Segmentation
based

ii)                 
 Segmentation free.

Related Work

S. Malik and S.A. Khan used “a rule based slant
analysis and conversion” for online Urdu handwriting recognition. Their system
is able to recognize isolated Urdu characters, numbers, and 200 two character
Urdu words with a recognition rate of 93% for isolated characters and numbers
and 78% for two character words. S.A. Hussain et al. used a segmentation free
approach with 20 di?erent
structural features for recognition of 850 single character, 2 character and 3
characters ligatures enabling recognition of 18000 common words of Urdu dictionary.
They used BPNN (back Propagation Neural Network) as a classifier with accuracy
of 93% for base ligatures and 98% for secondary ligatures. M.I. Razzak and S.A.
Hussain presented a segmentation free approach for recognition of online Urdu
text using a hybrid classifier of HMM and Fuzzy Logic. Authors report a
recognition rate of 87.6% and 74.1% for Nastaliq and Naskh styles respectively.
K.U. Khan and I. Haider applied various classifier such as correlation based
classifier, back propagation neural network classifier and probabilistic neural
network based classifier on isolated online handwritten Urdu characters and
found that probabilistic neural network based classifier works best. A database
of 110 instances of handwritten Urdu characters from 40 individuals of di?erent age groups was used and recognition rate
of 94% to 98% was reported for 4 di?erent groups of Urdu characters set classified on the bases of number of
strokes. M. I. Razzak et al. applied combined online and o?ine preprocessing techniques on Urdu text for
improving e?ciency of the Urdu character recognition process.
Z. Ahmed and J. K. Orakzai used feed forward neural network for recognition of
offline Urdu text. Size of the input text was kept constant and text was
assumed to be diacritic free. They report a recognition rate of 93.4%. T. Nawaz
et al.  Appied pattern matching technique
on the chain code for the recognition of isolated Urdu characters in Naskh
style. They report a recognition rate of 89%. I. Shamsher et al. also used feed
forward neural network for recognition of isolated Urdu characters. They report
accuracy of 98.3%. S. A. Hussain et al.used Kohonen Self organizing Map (KSOM)
for pre segmented Urdu characters in Naskh style. Their system can handle 104
segmented character ligatures with 80% accuracy. S. Sardar and A. Wahab used K Nearest
Neighbor (KNN) algorithm for isolated online and offline Urdu characters using
5 features. They report a recognition rate of 97.12%.

 

Material and methods

In this section we will take a close look at complexities
involved in the recognition process.

Number of Character Shapes

In Arabic
each letter can have di?erent
shapes depending on its position i.e. initial, middle and ending. Some letters
join with other letters from both sides, some join from only one side and some
do not join at all. Each connected piece of characters is also known as
ligature or sub word. Thus a word can consist of one or more sub words. In Urdu
the shape of the character not only depend on its position but also on the
character to which it is being joined. The characters change their shape in
accordance with the neighboring characters. This feature of Nastaliq is also
known as context sensitivity. Thus in Urdu the possible shapes of a single
character are not limited to 3 but it can have many more shapes depending on
the preceding and following characters. Among these classes character hamza (?)
do not join from any side and make only one primary shape while all other
characters connect form either right or both sides. Di?erent shapes of charter bay (?) when joined with
characters from di?erent
classes at di?erent positions.

 

Slopping

The calligraphic nature of Nastaliq also introduces
slopping in the text. Slopping mean that as the new letters are joined with
previous letters, a slope is introduced in the text because the letters are
written diagonally from top right to bottom left. One of the major advantages
of slopping is that it conserves a lot of writing space.

Slopping also means that characters no more join
with each other on the baseline which is an important property in Naskh. It is
utilized in the character segmentation algorithms for Arabic/Persian text. So
the character segmentation algorithms designed for Arabic/Persian text cannot
be applied on the Urdu text. Number of character shapes and slopping makes
Nastaliq character segmentation most challenging task in the whole recognition
process and till now in our knowledge not a single algorithm exists which
promises decent results in segmentation of sub words into individual characters.
This is also one of the main hurdle which keeps most of the researchers away
from accepting the challenge of Nastaliq character recognition.

Stretching

Another very important property of the Nastaliq
style is stretching. Stretching means that letters are replaced with a longer
versions instead of their standard version. Some characters even change their
default shape when stretched i.e. seen (?) however some only change their
width. The purpose of stretching is not only to bring more beauty into the character
but it also serves as a tool for justification. Justification means that the text meets the boundaries of the bounded area
irrespective to the varying length of the sentences. However it should be noted
that not every character in Urdu can be stretched. For example alif (?), ray (?),
daal (?) cannot be stretched but bay (?), seen (?) and fay (?) can be
stretched. It should also be noted that stretching works closely with the
context sensitive property of Nastaliq and certain class of characters can only
be stretched when joined with another character of a certain class or written
at a certain position (initial, medial and end). All these attributes of
stretching show that stretching is a complex procedure and it also increases the
complexity in machine recognition. Standard Nastaliq fonts used in the prints
normally do not support stretching. However it is commonly used in the titles
of the books and calligraphic art. So if we are dealing only with machine
printed Nastaliq text, we normally do not need to worry about stretching, but
if we are dealing with calligraphic or handwritten Nastaliq document, there is
a huge possibility that we have to deal with stretched version of characters.

Positioning
and Spacing

Like stretching, positioning and spacing are an
important tool for justification in Nastaliq and are also used for the beautification
of text. Positioning means the placement of ligatures and sub words in Nastaliq
and spacing means the space between two consecutive ligatures. In normal
situations the ligatures are written to right of previous ligature with a small
standard spacing. But positioning allows the ligatures to be placed at di?erent positions such as new ligature is started
somewhere from the top of previous ligature or it can be placed right above it
even if it is a part of another word. Positing will not care even it had to
overlap and connect two ligatures if the need arises. Unlike stretching,
positioning is quite common and used extensively in the news heading in the
Urdu print media industry because of its extreme power to accommodate long and
big headings in small spaces in the paper. All these flexibilities and
strengths of Nastaliq make it real challenge for the machine recognition. On
one hand context sensitivity and sloping makes the character segmentation a
very di?cult task and on the other hand positioning
makes even the ligature and sub word segmentation equally more di?cult.

 Complex Dot Placement Rules

In Urdu a character can have up to three dots
placed above, below or inside it. However slopping and context sensitivity can
alter the rules for the standard positions of dots. In many situations due to
slopping and context sensitivity, there won’t be enough space for the dots to
be placed at standard position such as inside or right below the character. In
that case the dots will be moved from their standard position to some other
position nearby. The characters whose presence can influence standard dot
placement rules are beariyay (?), jeem (?), chey (?), hey (?), khey (?), fey (?),
qeaf (?), ain (?) and qaaf (?). Simple nature of Naskh do not face this issue
and dots will always be found at specific locations for the character. However,
in case of Nastaliq, situation becomes more complex where it is more di?cult to associate dots to the correct primary
component.

The standard style for writing Urdu, Nastaliq,
is inherently complex for machine recognition due to its calligraphic nature.
The Challenge of Urdu Character Recognition is di?erent from Arabic Character Recognition because of these complexi-ties.
Various issues need to be resolved for Nastaliq Character Recognition among
which more impor-tant are context-sensitivity, slopping, positioning,
overlapping, filled loops and false loops. All the is-sues presented in this
paper are yet to be resolved thus require special attention. We believe that
these issues are complex and need to be considered indi-vidually by the
researchers. Once solved, it will lead to a robust solution to Urdu Nastaliq
OCR. So this paper can be taken as a road map to the solution of Urdu Nastaliq
OCR problem.

Conclusion

 

The
standard style for writing Urdu, Nastaliq, is inherently complex for machine
recognition due to its calligraphic nature. The Challenge of Urdu Character
Recognition is di?erent from Arabic Character Recognition because
of these complexities. Various issues need to be resolved for Nastaliq
Character Recognition among which more important are context-sensitivity,
slopping, positioning, overlapping, filled loops and false loops. All the issues
presented in this paper are yet to be resolved thus require special attention.
We believe that these issues are complex and need to be considered individually
by the researchers. Once solved, it will lead to a robust solution to Urdu
Nastaliq OCR. So this paper can be taken as a road map to the solution of Urdu
Nastaliq OCR problem.

x

Hi!
I'm Mack!

Would you like to get a custom essay? How about receiving a customized one?

Check it out