This is the project page for National Security Agency contract NBCHC040176, 09/30/2004-09/29/2006 under the auspices of REFLEX (Research on English and Foreign Language Exploitation).
The PI's on this project are:
Research Assistants assigned to this project are:
The purpose of this project is to fill the gap by providing resources and tools that will allow one to rapidly build named entity detectors for a collection of 50 languages, nearly all with speaker populations numbering in the millions, in which we have expertise. This includes nearly all languages that fall into the category of "Less Commonly Taught Languages". We will focus on the recognition of named entities falling into the categories of PERSON, ORGANIZATION and LOCATION. Why do we believe that it is feasible to provide resources for so many languages? Research in the last few years has shown that machine learning approaches can learn to recognize and classify named entities reliably. Recognizing named entities requires:
The key to our technical approach is the observation that semi-supervised learning methods can be used for both stages. With a lexicon of entities in a given language and language specific features for what constitute an entity as input, we will develop methods to bootstrap a named entity detector for new languages. For example, as demonstrated in (Collins and Singer, 1999), the second stage of this process can be solved reliably with a small number of initial examples. We propose a two year project. The first part of the project, accomplished in Year 1, will involve collecting plausible initial rule sets for the 50 languages listed below. These will include the widely spoken languages that have already received significant attention for NER (which we will include as a litmus test for our methods), as well as many other languages to which no attention has been given. For each language, we will collect-from dictionaries, grammars or online resources and with native speaker expertise-the following kinds of resources:
In a third part of the project, to be addressed in Years 1 and 2, we will also concern ourselves with the problem of identifying transliteration equivalents between arbitrary languages and English. In languages that use the Roman script this will generally not be an issue, but for languages that use other scripts we would like some way to determine that a particular named entity might correspond to a well-known entity name in English. We propose to make use of previous work on automatic transliteration (e.g. (Knight and Graehl, 1997)), coupled with a document-level model that compares the distribution of names in a given non-English document with the distribution of names in similar documents in English. More details will be given in the Tasks section below. Finally, in Year 2, we will evaluate our work. We cannot evaluate on 50 languages, but we can take a sampling of languages for which we develop resources in Year 1, and demonstrate the performance of our methods on these. This will require acquiring (unannotated) training corpora and (annotated) testing corpora. We expect to be able to develop these corpora from online sources. Since annotation is required for the testing portion, we will limit ourselves to languages in which we have local expertise. We propose the following ten languages for evaluation: Chinese, Arabic, Hindi, Marathi, Thai, Farsi, Amharic, Indonesian, Swahili, Quechua. This list includes both widespread languages, such as Chinese, as well as LCTL's. For languages like Chinese, we can use corpora that are already used for NER evaluation, as a way of comparing our methods with those of others. This will in turn give us a metric for comparison with performance on LCTL's so that we will have some sense, from these ten languages, of what the range of difficulties is for NER in various languages.
| Arbabi, Mansur, Scott M. Fischthal, Vincent C. Cheng, and Elizabeth Bart. 1994. Algorithms for Arabic name transliteration. IBM Journal of Research and Development, 38(2):183-193. | |
| Nasreen AbdulJaleel, Leah S. Larkey. 2002. "English to Arabic Transliteration for Information Retrieval: A Statistical Approach". CIIR Technical Reports. | |
| Nasreen AbdulJaleel, Leah S. Larkey. 2003. "Statistical Transliteration for English-Arabic Cross Language Information Retrieval". In CIKM '03, | |
| "Machine Transliteration of Names in Arabic Text" (Y. Al-Onaizan and K. Knight), Proc. of ACL Workshop on Computational Approaches to Semitic Languages, 2002. | |
| Gao, Wei, Kam-Fai Wong, and Wai Lam. 2004. Phoneme-based transliteration of foreign names for OOV problem. In First International Joint Conference on Natural Language Processing, pages 374--381, Sanya, Hainan, China. Asia Federation for Natural Language Processing. | |
| Gao, Wei. 2004. Phoneme-based Statisitcal Transliteration of Foreign Names for OOV Problem. Masters Thesis. Chinese University of Hong Kong. | |
| Asanee Kawtrkul, Amarin Deemagarn, Chalatip Thumkanon, Navapat Khantonthong and Paul McFetridge. 1998. Backward Transliteration for Thai Document Retrieval. In Proceedings of the 1998 IEEE Asia-Pacific Conference on Circuits and Systems, pages 563-566. | |
| "Machine Transliteration," (K. Knight and J. Graehl), Computational Linguistics, 24(4), 1998. | |
| Larkey, L., AbdulJaleel, N. and Connell, M., 2003. "What's in a Name?: Proper Names in Arabic Cross Language Information Retrieval" CIIR Technical Report. | |
| Virga, P. and S. Khudanpur. 2003. Transliteration of proper names in cross-lingual information retrieval. In Proceedings of the ACL Workshop on Multi-lingual Named Entity Recognition. | PDF (restricted access). |