Unsupervised Entity Classification with Wikipedia and WordNet

English
Česky

Thesis title:	Unsupervised Entity Classification with Wikipedia and WordNet
Author:	Kliegr, Tomáš
Thesis type:	Dissertation thesis
Supervisor:	Rauch, Jan
Opponents:	Berka, Petr; Smrž, Pavel ; Žabokrtský, Zdeněk
Thesis language:	English
Abstract:	This dissertation addresses the problem of classification of entities in text represented by noun phrases. The goal of this thesis is to develop a method for automated classification of entities appearing in datasets consisting of short textual fragments. The emphasis is on unsupervised and semi-supervised methods that will allow for fine-grained character of the assigned classes and require no labeled instances for training. The set of target classes is either user-defined or determined automatically. Our initial attempt to address the entity classification problem is called Semantic Concept Mapping (SCM) algorithm. SCM maps the noun phrases representing the entities as well as the target classes to WordNet. Graph-based WordNet similarity measures are used to assign the closest class to the noun phrase. If a noun phrase does not match any WordNet concept, a Targeted Hypernym Discovery (THD) algorithm is executed. The THD algorithm extracts a hypernym from a Wikipedia article defining the noun phrase using lexico-syntactic patterns. This hypernym is then used to map the noun phrase to a WordNet synset, but it can also be perceived as the classification result by itself, resulting in an unsupervised classification system. SCM and THD algorithms were designed for English. While adaptation of these algorithms for other languages is conceivable, we decided to develop the Bag of Articles (BOA) algorithm, which is language agnostic as it is based on the statistical Rocchio classifier. Since this algorithm utilizes Wikipedia as a source of data for classification, it does not require any labeled training instances. WordNet is used in a novel way to compute term weights. It is also used as a positive term list and for lemmatization. A disambiguation algorithm utilizing global context is also proposed. We consider the BOA algorithm to be the main contribution of this dissertation. Experimental evaluation of the proposed algorithms is performed on the WordSim353 dataset, which is used for evaluation in the Word Similarity Computation (WSC) task, and on the Czech Traveler dataset, the latter being specifically designed for the purpose of our research. BOA performance on WordSim353 achieves Spearman correlation of 0.72 with human judgment, which is close to the 0.75 correlation for the ESA algorithm, to the author's knowledge the best performing algorithm for this gold-standard dataset, which does not require training data. The advantage of BOA over ESA is that it has smaller requirements on preprocessing of the Wikipedia data. While SCM underperforms on the WordSim353 dataset, it overtakes BOA on the Czech Traveler dataset, which was designed specifically for our entity classification problem. This discrepancy requires further investigation. In a standalone evaluation of THD on Czech Traveler dataset the algorithm returned a correct hypernym for 62% of entities.
Keywords:	named entity recognition; Hearst patterns; bag of words; natural language processing

Thesis title:	Klasifikace entit pomocí Wikipedie a WordNetu
Author:	Kliegr, Tomáš
Thesis type:	Disertační práce
Supervisor:	Rauch, Jan
Opponents:	Berka, Petr; Smrž, Pavel ; Žabokrtský, Zdeněk
Thesis language:	English
Abstract:	Dizertační práce se věnuje problému klasifikace entit reprezentovaných jmennými frázemi v textu. Cílem je vyvinout metodu pro automatizovanou klasifikaci těchto entit v datasetech skládajících se z krátkých textových fragmentů. Důraz je kladen na metody učení bez učitele, nebo kombinaci učení s učitelem a bez učitele (angl. semi-supervised learning), přičemž nebudou vyžadovány trénovací příklady. Třídy jsou buď automaticky stanoveny nebo zadány uživatelem. Náš první pokus pro řešení problému klasifikace entit je algoritmus Sémantického Mapování Konceptů (angl. Semantic Concept Mapping -- SCM). Tento algoritmus mapuje jmenné fráze i cílové třídy na koncepty thesauru WordNet. Grafové míry podobnosti pro WordNet jsou použity pro přiřazení nejbližší třídy k dané jmenné frázi. Pokud jmenná fráze není namapována na žádný koncept, potom je použit algoritmus Cíleného Objevování Hyperonym (angl. Targeted Hypernym Discovery -- THD). Tento algoritmus extrahuje s pomocí lexiko-syntaktických vzorů hyperonymum z článku na Wikipedii, který danou jmennou frázi definuje. Toto hyperonymum je použito k namapování jmenné fráze na koncept ve WordNetu. Hyperonymum může být samo o sobě také považováno za výsledek klasifikace. V takovém případě je dosaženo klasifikace bez učitele. Algoritmy SCM a THD byly navrženy pro angličtinu. I když je možné oba algoritmy přizpůsobit i pro jiné jazyky, byl v rámci dizertační práce vyvinut algoritmus Pytel článků (angl. Bag of Articles -- BOA), který je jazykově agnostický, protože je založen na statistickém Rocchio klasifikátoru. Díky zapojení Wikipedie jako zdroje informací pro klasifikaci nevyžaduje BOA trénovací data. WordNet je využit novým způsobem, a to pro výpočet vah slov, jako pozitivní seznam slov a pro lematizaci. Byl také navržen disambiguační algoritmus pracující s globálním kontextem. Algoritmus BOA považujeme za hlavní přínos dizertace. Experimentální hodnocení navržených algoritmů je provedeno na datasetu WordSim353 používaném pro hodnocení systémů pro výpočet podobnosti slov (angl. Word Similarity Computation -- WSC), a na datasetu Český cestovatel, který byl vytvořen speciálně pro účel našeho výzkumu. Na datasetu WordSim353 dosahuje BOA Spearmanova korelačního koeficientu 0.72 s lidským hodnocením. Tento výsledek je blízko hodnotě 0.75 dosažené algoritmem ESA, který je podle znalosti autora nejlepším algoritmem pro daný dataset nevyžadujícím trénovací data. Algoritmus BOA je ale výrazně méně náročný na předzpracování Wikipedie než ESA. Algoritmus SCM nedosahuje dobrých výsledků na datasetu WordSim353, ale naopak předčí BOA na datasetu Český cestovatel, který byl navržen speciálně pro úlohu klasifikace entit. Tato nesrovnalost vyžaduje další výzkum. V samostatném hodnocení THD na malém počtu pojmenovaných entit z datasetu Český cestovatel bylo správné hyperonymum nalezeno v 62 % případů. Další dosažené výsledky samostatného významu zahrnují novou funkci pro vážení slov založenou na WordNetu, kvalitativní a kvantitativní vyhodnocení možností využití Wikipedie jako zdroje textů pro objevování hyperonym s využitím lexiko-syntaktických vzorů a zevrubnou rešerši měr podobnosti nad WordNetem zahrnující též jejich výkonnostní porovnání na datasetech WordSim353 a Český cestovatel.
Keywords:	Hearst patterns; rozpoznání pojmenovaných entit; pytel slov; zpracování přirozeného jazyka

Information about study

Study programme:	Aplikovaná informatika/Informatika
Type of study programme:	Doktorský studijní program
Assigned degree:	Ph.D.
Institutions assigning academic degree:	Vysoká škola ekonomická v Praze
Faculty:	Faculty of Informatics and Statistics
Department:	Department of Information and Knowledge Engineering

Information on submission and defense

Date of assignment:	30. 9. 2007
Date of submission:	30. 9. 2012
Date of defense:	5. 11. 2012
Identifier in the InSIS system:	https://insis.vse.cz/zp/14953/podrobnosti

Files for download

Main text
14953_klit01.pdf, 2.1 MB Download

Opponent's review
28063_berka.pdf, 99.2 kB Download

Opponent's review
28064_Smrž.pdf, 109.4 kB Download

Opponent's review
28065_Žabokrtský.pdf, 58.7 kB Download

Supervisor's review
14953_rauch.pdf, 160.1 kB Download