Background annotation of entities in Linked Data vocabularies

English
Česky

Thesis title:	Background annotation of entities in Linked Data vocabularies
Author:	Serra, Simone
Thesis type:	Diploma thesis
Supervisor:	Svátek, Vojtěch
Opponents:	Zamazal, Ondřej
Thesis language:	English
Abstract:	One the key feature behind Linked Data is the use of vocabularies that allow datasets to share a common language to describe similar concepts and relationships and resolve ambiguities between them. The development of vocabularies is often driven by a consensus process among datasets implementers, in which the criterion of interoperability is considered to be sufficient. This can lead to misrepresentation of real-world entities in Linked Data vocabularies entities. Such drawbacks can be fixed by the use of a formal methodology for modelling Linked Data vocabularies entities and identifying ontological distinctions. One proven example is the OntoClean methodology for curing taxonomies. In this work, it is presented a software tool that implements the PURO approach to ontological distinction modelling. PURO models vocabularies as Ontological Foreground Models (OFM), and the structure of ontological distinctions as Ontological Background Models (OBM), constructed using meta-properties attached to vocabulary entities, in a process known as vocabulary annotation. The software tool, named Background Annotation plugin, written in Java and integrated in the Protégé ontology editor, enables a user to graphically annotate vocabulary entities through an annotation workflow, that implements, among other things, persistency of annotations and their retrieval. Two kinds of workflows are supported: generic and dataset-specific, in order to differentiate a vocabulary usage, in terms of a PURO OBM, with respect to a given Linked Data dataset. The workflow is enhanced by the use of dataset statistical indicators retrieved through the Sindice service, for a sample of chosen datasets, such as the number of entities present in a dataset, and the relative frequency of vocabulary entities in that dataset. A further enhancement is provided by dataset summaries that offer an overview of the most common entity-property paths found in a dataset. Foreseen utilisation of the Background Annotation plugin include: 1) the checking of mapping agreement between different datasets, as produced by the R2R framework and 2) annotation of dependent resources in Concise Boundaries Descriptions of entities, used in data sampling from Linked Data datasets for data mining purposes.
Keywords:	ontological distinctions; vocabulary mappings; Protégé ontology editor; Linked Data vocabularies

Thesis title:	Background annotation entit v Linked Data slovníků
Author:	Serra, Simone
Thesis type:	Diplomová práce
Supervisor:	Svátek, Vojtěch
Opponents:	Zamazal, Ondřej
Thesis language:	English
Abstract:	Jedním z klíčovým prvkem za Linked Data je použití slovníků, které umožňují datasety sdílet společný jazyk k popisu podobné pojmy a vztahy a vyřešit nejasnosti mezi nimi. Rozvoj slovníků je často poháněn procesu shody mezi realizátorů datasety, ve které je kritérium interoperability považována za dostatečnou. To může vést ke zkreslení reálných entit Linked Data slovníky. Tyto nedostatky lze odstranit použitím formální metodiky pro modelování Linked Data slovníků a identifikaci ontologické rozdíly. Osvědčené příkladem je OntoClean metodika pro vytvrzování taxonomií. V této práci se představila softwarový nástroj, který implementuje PURO přístup k ontologické rozlišování modelování. PURO modeluje slovníky jako ontologický foreground modely (OFM) a struktura ontologických rozdílů jako ontologické background modely (OBM), vyrobena s použitím meta-vlastnosti připojené k entitám slovníků v procesu známém jako anotace slovníků. Softwarový nástroj s názvem Background Annotation plugin, napsán v Javě a integrován do editoru ontologií Protégé, umožňuje uživateli graficky anotaci entit slovníků prostřednictvím anotace workflow, která implementuje, mimo jiné, vytrvalosti anotací a jejich načítaní. Dva druhy pracovních režimů jsou podporovány: obecné a dataset specifické, aby se odlišily slovník použití, pokud jde o PURO OBM, s ohledem na daný Linked Data dataset. Workflow je zvýšený použitím statistických ukazatelů Linked Data datasetů získaných prostřednictvím služby Sindice, u vzorku vybraných datasetů, jako je počet entit přítomných v datasetu a relativní četnosti entit slovníku v tomto datasetu. Další posílení poskytuje přehled datasetů, který nabízejí přehled nejběžnějších cest nacházející v datasetu. Předpokládané využití pluginu pozadí Anotace patří: 1) kontrola mapování dohody mezi různými datasety, jako vyrobené rámci R2R a 2) Anotace závislých zdrojů v Concise Bouded Descriptions entit, které se používají při vzorkování dat ze Linked Data datasetů pro účel dolováni dat.
Keywords:	ontologické rozdíly; mapovaní slovníků; Linked Data slovníky; Editor ontologií Protégé

Information about study

Study programme:	Aplikovaná informatika/Kognitivní informatika
Type of study programme:	Magisterský studijní program
Assigned degree:	Ing.
Institutions assigning academic degree:	Vysoká škola ekonomická v Praze
Faculty:	Faculty of Informatics and Statistics
Department:	Department of Information Technologies

Information on submission and defense

Date of assignment:	27. 9. 2012
Date of submission:	30. 5. 2013
Date of defense:	5. 6. 2013
Identifier in the InSIS system:	https://insis.vse.cz/zp/39039/podrobnosti

Files for download

Main text
39039_xsers01.pdf, 2.1 MB Download

Opponent's review
30805_svabo.pdf, 51.2 kB Download

Supervisor's review
39039_svatek.pdf, 181.8 kB Download