REFLEX: Named Entity Recognition and Transliteration for 50 Languages

This is the project page for National Security Agency contract NBCHC040176, 09/30/2004-09/29/2006 under the auspices of REFLEX (Research on English and Foreign Language Exploitation).

The PI's on this project are:

Research Assistants assigned to this project are:

Original Proposal

Summary

The research on methods for Named Entity Recognition (NER) is voluminous but has tended to focus on the problem in widely used languages such as English, otherWestern European languages, Arabic, and Asian languages such as Chinese, Japanese and Korean.

The purpose of this project is to fill the gap by providing resources and tools that will allow one to rapidly build named entity detectors for a collection of 50 languages, nearly all with speaker populations numbering in the millions, in which we have expertise. This includes nearly all languages that fall into the category of "Less Commonly Taught Languages". We will focus on the recognition of named entities falling into the categories of PERSON, ORGANIZATION and LOCATION. Why do we believe that it is feasible to provide resources for so many languages? Research in the last few years has shown that machine learning approaches can learn to recognize and classify named entities reliably. Recognizing named entities requires:

  1. Recognizing phrase boundaries of entity phrases and
  2. Classifying the target phrases to one of several types of entities (or a miscellaneous one).

The key to our technical approach is the observation that semi-supervised learning methods can be used for both stages. With a lexicon of entities in a given language and language specific features for what constitute an entity as input, we will develop methods to bootstrap a named entity detector for new languages. For example, as demonstrated in (Collins and Singer, 1999), the second stage of this process can be solved reliably with a small number of initial examples. We propose a two year project. The first part of the project, accomplished in Year 1, will involve collecting plausible initial rule sets for the 50 languages listed below. These will include the widely spoken languages that have already received significant attention for NER (which we will include as a litmus test for our methods), as well as many other languages to which no attention has been given. For each language, we will collect-from dictionaries, grammars or online resources and with native speaker expertise-the following kinds of resources:

  1. Unambiguous personal titles (e.g. English Mr.)
  2. Unambiguous organization titles (e.g. Corporation, Incorporated)
  3. Unambiguous place names.
  4. Language-particular rules for titles that determine on which side of the title the name occurs. (E.g. Mr. occurs on the left of the name in English, but xiansheng 先生 occurs on the right of the name in Mandarin.)
In parallel with the first part of the project, in the second part we will develop Machine-Learning algorithms that will produce high quality NE detectors from a small set of initial seed rules. We propose to use semi-supervised learning methods, similar to those suggested in (Yarowsky, 1995), (Collins and Singer, 1999) and others. Our approach will develop phrase boundary detectors (Punyakanok and Roth, 2001) and classifiers for entities (Roth and Yih, 2001; Roth and Yih, 2002; Roth and Yih, 2004) and will make use of the SNoW learning architecture (Carlson et al., 1999). Our approach will make use of named entity specific linguistics features to identify different renditions of the same entities (Li, Morie, and Roth, 2004b; Li, Morie, and Roth, 2004a), including abbreviations, and document level inferences to learn from multiple occurrences of entities in the same document.

In a third part of the project, to be addressed in Years 1 and 2, we will also concern ourselves with the problem of identifying transliteration equivalents between arbitrary languages and English. In languages that use the Roman script this will generally not be an issue, but for languages that use other scripts we would like some way to determine that a particular named entity might correspond to a well-known entity name in English. We propose to make use of previous work on automatic transliteration (e.g. (Knight and Graehl, 1997)), coupled with a document-level model that compares the distribution of names in a given non-English document with the distribution of names in similar documents in English. More details will be given in the Tasks section below. Finally, in Year 2, we will evaluate our work. We cannot evaluate on 50 languages, but we can take a sampling of languages for which we develop resources in Year 1, and demonstrate the performance of our methods on these. This will require acquiring (unannotated) training corpora and (annotated) testing corpora. We expect to be able to develop these corpora from online sources. Since annotation is required for the testing portion, we will limit ourselves to languages in which we have local expertise. We propose the following ten languages for evaluation: Chinese, Arabic, Hindi, Marathi, Thai, Farsi, Amharic, Indonesian, Swahili, Quechua. This list includes both widespread languages, such as Chinese, as well as LCTL's. For languages like Chinese, we can use corpora that are already used for NER evaluation, as a way of comparing our methods with those of others. This will in turn give us a metric for comparison with performance on LCTL's so that we will have some sense, from these ten languages, of what the range of difficulties is for NER in various languages.

Full Proposal

PDF is here (password protected).

Data

Data links have been moved here (password protected).

Background Reading

Automatic Transliteration

Arbabi, Mansur, Scott M. Fischthal, Vincent C. Cheng, and Elizabeth Bart. 1994. Algorithms for Arabic name transliteration. IBM Journal of Research and Development, 38(2):183-193. PDF
Nasreen AbdulJaleel, Leah S. Larkey. 2002. "English to Arabic Transliteration for Information Retrieval: A Statistical Approach". CIIR Technical Reports. PDF
Nasreen AbdulJaleel, Leah S. Larkey. 2003. "Statistical Transliteration for English-Arabic Cross Language Information Retrieval". In CIKM '03, PDF
"Machine Transliteration of Names in Arabic Text" (Y. Al-Onaizan and K. Knight), Proc. of ACL Workshop on Computational Approaches to Semitic Languages, 2002. PDF
Gao, Wei, Kam-Fai Wong, and Wai Lam. 2004. Phoneme-based transliteration of foreign names for OOV problem. In First International Joint Conference on Natural Language Processing, pages 374--381, Sanya, Hainan, China. Asia Federation for Natural Language Processing. PDF
Gao, Wei. 2004. Phoneme-based Statisitcal Transliteration of Foreign Names for OOV Problem. Masters Thesis. Chinese University of Hong Kong. PDF
Asanee Kawtrkul, Amarin Deemagarn, Chalatip Thumkanon, Navapat Khantonthong and Paul McFetridge. 1998. Backward Transliteration for Thai Document Retrieval. In Proceedings of the 1998 IEEE Asia-Pacific Conference on Circuits and Systems, pages 563-566. PDF
"Machine Transliteration," (K. Knight and J. Graehl), Computational Linguistics, 24(4), 1998. PDF
Larkey, L., AbdulJaleel, N. and Connell, M., 2003. "What's in a Name?: Proper Names in Arabic Cross Language Information Retrieval" CIIR Technical Report. PDF
Virga, P. and S. Khudanpur. 2003. Transliteration of proper names in cross-lingual information retrieval. In Proceedings of the ACL Workshop on Multi-lingual Named Entity Recognition. PDF (restricted access).
==