C-DAC Logo

    

Indian language Search Engine Technologies - Problems and Solutions

  • Indian language Search - Overview
  • Indian Music Search Indian Music Search Engine
  • Search Engine PLUG-INS
  • Problems with existing UNICODE based engines
  • GClass : Gist Cross Language Search Plug-ins Suite [for Indian language Search Engine]
  • Semantic Web - Indian language overview

Indian language Search - Overview

C-DAC GIST has developed Indian language Search Engine and Data mining Technologies. Search Engines, Database Vendors and e-governance implementers, can use these tools to offer better Indian language results and for data mining. GIST provides a list of Problems faced by users while searching Indian language on Web and Desktop based Seach Engines as well as Indian language Database.

The Web is constantly evolving from the basic Web to WWW2 and today to WWW3, which is mainly focused on man-machine interaction. Information retrieval, which is fast, accurate and user-friendly are the watchwords of the WWW3. Here, Natural Language Interfaces play a crucial role. The GIST RnD Labs are working in two focal areas: Indian Language Plug-Ins for Search Engines and the Semantic Web

Indian Music SearchIndian Music Search Engine

What is Music Search :

Music Search is a web based application, which simulates a Search Engine, in the Music Domain.

The searching is classified into four categories,

  • Search Actor / Actress or Hero / Heroin
  • Search Movie by Name
  • Search Singer
  • Search Song Title

The search is based on exact match of the entered word and also similar sounding words and misspellings. Music Search application is powered by GIST Homophone engine which is responsible for giving the exact matches of the word, providing with the similar spelling words and words having similar pronunciation..

For a free demo contact us : info.gist@cdac.in

How is Music Search in Indian language different from others :

When written in English, Indian words have many variations in the spelling. It is not easy for the user to type in the correct spellings of some complex Hindi words while searching for the same. So this search engine ensures that the search output should include all the words with exact spelling, and also similar spelling of the word entered by the user. Say for example if the user enters the word 'Abhijjet' then our search will give result of word Abhijeet and Abhijit.

If the user enters input word 'Bombay' then GIST Music Search Engine output includes Bambai, Bombai, Mumbai and Bombay.

Usability :

This search engine has many applications in the Music Domain where a user can search for Artist name, Movie name, Song title or Singer name. Where user may not be able to spell correctly in English.

GIST Music search Engine can be used as a plugin component for various music sites and would ease the usability of the site by facilitating such intensive search on the available database thus, making the existing site more user friendly.

For a free demo contact us : info.gist@cdac.in

Search Engine PLUG-INS

Search Engines, are mainly statistical in nature and the "heuristics" is based on statistical prediction and the ranking algorithm does not satisfy the user very often.

However useful they are, there are serious problems associated with their use:

  • Information over-kill without precision. Too much is as bad as too little content. If the user has to go through 12,000 pages to get relevant information, (s)he has neither the time nor the energy to go through the pages.
  • Low or zero Information. The converse is equally possible although more rare in occurrence.
  • Sensitivity to wording. A change in wording pops up different pages, as does very often a change in spelling or the use of spelling variants.
  • Monolithic results. If information is needed about various pieces of data, separate queries have to be initiated and then combined to meet the requirement. If the user wants to book a ticket on a train or a plane, multiple querying alone will meet the requirement.

In other words Web Content outstrips Web retrieval technology. Search engines are therefore at their best only "addresses" on the Information highway and to call them "Information Retrieval Tools" is a misnomer.

In the case of Indian Languages such as Hindi, the problem is even more acute:

  • Spelling Variants
  • Widely used Incorrect spellings
  • Lack of correspondence between script and language
  • Intra-word Grammars
  • Complex inflectional or agglutinating nature of Indian Language,
  • Legacy data
  • Multilingual data in Indian languages
  • Natural Query

Thanks to sophisticated linguistic search tools of CDAC GIST, the Indian language plug-ins allow such searches to be carried out.

Problems with existing Indian language Search engines and Databases

Engines such as Yahoo and Google provide Indian language support. Databases Such as ORACLE, MS-SQL and MYSQL also support Indian languages. Support is UNICODE based search but is not good for Indian languages. Yahoo and Google search are inadequate while catering to Indian languages because of following reasons: .

Problem 1

Multiple correct spellings

Indian languages have several words have multiple correct spellings and alternate representation forms eg: the word Hindi itself may be written with a bindi on top of the first syllable or with a half na.

What should happen in case of searching such words in search engines

Another example of problems with Search Engines is representations of the word vitthal

Problem 2

4.1 Many languages one script.

Given a page with Devanagari Encoding

- Devanagari supports 54 languages / dialects of which the main ones are Hindi, Marathi, Konkani, Nepali, Dogri, etc.. Similarly the Bengali and Urdu code pages are used by many languages of the world. The search engine must be able to identify the language correctly to avoid giving wrong results.

It is never nice to have Marathi page result for a Hindi Query, or Bangla page for Assamese search or Arabic web-page for Urdu search request.

2.2 One language many scripts. How to search a word in Konkani.

Konkani may have web page in Devanagari, Roman, Kannada, Malaylam. Similarly Sindhi may have web page in Devanagari, Gujarati, Roman, Perso-Arabic. User may be know of more than one combination and will appreciate both results.

Problem 3

Direct String Match or SQL Select query for Indian languages dont work -

For Indian Languages A 'UNICODE only' search engine is not sufficient

3.1 Multiple encodings of Data are not supported : The websites with Indian languages (Including E-Gov applications, which form a major chunk) may be in a variety of proprietary hack font encodings, from different vendors, which may or may not be suitable for web or in UNICODE / UTF-8. So identifying that this page is of Hindi is difficult based solely on UNICODE code page. Language Identification (of page encoding and its content) is even more difficult. Indexing, classifying and storing the web page becomes a major offline task during crawling.

3.2 In UNICODE Indian language data requires normalisation (eg: ja+nukta like in Reserve Bank). So data present on a webpage in one form may not match string which user has entered in search box.

Most applications including popular databases such as ORACLE and MS-SQL do not support 'like' query for Indian languages in Select statement of SQL.

3.3 Large web-userbase in India uses Windows 9x systems or XP systems without Indian language pack to access the net over low bandwidth. GIST Tools can support typing Indian languages in UNICODE on browsers in Windows 9x systems or XP systems even without Indian language pack. click to find : How to type Indian languages on web

Problem 4

Typographical variants for Indian language search.

Most Search Engines offer a 'did you mean' options based on statistics. In Indian languages it has been observed on the web that incorrect spellings may be used more often, so statistical based search engines or those dependent on crowd-sourcing fail.

Problem 5

Language variants in Indian language search.

Due to the complex nature of Indian Languages, a user should be given results, which include the linguistic variants including suffixes of searched terms.

Problem 6

Owing to historical reasons, Indian languages are rich in synonyms with the same word having various synonyms. These are used indifferently in web content and BLOGs.

Synonyms

- Similar meaning (eg: Bharat, India)

- colloquial terms

- Old terms match like (Madras search giving Chennai results)

Solution to Indian language search and database problems -

G-CLASS : GIST Cross Language Search Plug-ins Suite [for Indian language Search Engine]

Why Indian language search plug-ins ?

Gclass stands for Gist Cross Language Search Plug-ins Suite. Indian Languages are unique in their structure and are quite complex in nature. Click here : for an insight into the structural complexity of Indian Languages . Because of this intrinsic difficulty normal standard search methodology is inappropriate for Indian Languages. So Web or Desktop or Database search for Indian languages demands special tool. GClass comes to aid and provides a suite of plug-ins that deals with exactly these difficulties and provide solutions. contact us : info.gist@cdac.in

Plug-Ins available with GClass :

1. Alternate Spelling for Indian languages :

Indian languages abound in alternate spellings. Thus there are two ways to spell word Hindi viz “हिंदी” and “हिन्दी”. The search on one form should provide the results with other form as well. This plug-in makes the use of rule governed homophonic engine to provide the result in the said manner. When the entered query have spelling variant, it gives the other variants as suggestions allowing the user to search for a specific one or all.

2. Mis-spelling :

Like all languages, Indian languages also have their share of misspelled words. In Indian language some misspelled words are more prominently in use than their grammatically correct counter part. For example the word “जांच” is incorrect but is used more often than its correctly spelled form “जाँच”. G-CLASS allows the user to cater the web and desktopn search results that contains the mis spelled form of the entered word as well.

It also allows the user to filter the search through spell checker and suggest to user the correct spelling.

3. Synonyms:

Because of their historical antecedents Indian languages are rich in vocabulary, with more than one synonym for a particular word. To ensure that synonums are trapped in the search net, the synonym suite provide the most common synonymic equivalents of the word thereby enriching the search capabilities. Look for भाषा and also see the most common synonyms for language:

4. Multi Lingual Lookup:

This plug-in allows the user to enter query in English and get the search result in desired language. So it is a boon for people who know English well but do not know Hindi equivalent of the search word. Yet another form of the plug-in transliterates on the fly search result from any language to any other language thereby enabling the user to get the results in desired language

5. Lemmatiser for Indian languages:

Intra word grammar is one of the major attribute of Indian Languages and so the user should be given the results which include the linguistic variants including suffixes of searched terms, like “चुने” ,”चुनकर”, ”चुनिये” etc.

6. Natural Query System :

Search engines should provide a natural query system which allows the user to query the web and get an answer to his/her query. Instead of getting a million answers to a simple query such as the price of Gold today, our Natural Query System plug-in coupled with the spell checker and cross lingual module, provides a correct answer to the query.

Currently available in Hindi, Marathi, Gujarati, Bangla, Malayalam and Urdu, other languages are targeted. The plug-ins are equipped with high quality apparatus like language detector (for distinguishing languages which use same script eg: Hindi and Marathi), homophone engine, Spell checker, Lemmatiser and Dictionaries.

Semantic Web - Indian language overview

The aim of the Semantic Web is to allow much more advanced knowledge management systems such that:

  • Knowledge will be organized in conceptual spaces according to its meaning.
  • Automated tools will support maintenance by checking for inconsistencies and extracting new knowledge.
  • Keyword-based search will be replaced by query answering: requested knowledge will be retrieved, extracted, and presented in a human friendly way.
  • Query answering over several documents will be supported.
  • Defining who may view certain parts of information (even parts of documents) will be possible.

The Search Engine of tomorrow will be semantic Web-compliant and GIST Labs are already working in this area to develop a Semantic Web for Indian languages which, will address all major issues that are pertinent to Indian scripts such as tomography, ontology, creation of agents within the framework of a true Information retrieval.

Click to visit Home Page

Visit C-DAC website: iPlugin home page.

Contact information for this page : sales.gist@cdac.in

Software Guide : Assamese, Bengali, Bodo, Dogri, Hindi, Gujarati, Kashmiri, Kannada, Konkani, Maithli, Marathi, Malayalam, Manipuri, Nepali, Oriya, Punjabi, Sanskrit, Santali, Sindhi, Tamil, Telugu, Urdu

Valid XHTML 1.0 Transitional

Legal Notices| Privacy Policy| © 2012 C-DAC. All rights reserved. Last Updated: Monday, April 6, 2012
fooetr_reflection