Using large language models for information extraction with applications in knowledge graphs

Česky
English

Thesis title:	Využití velkých jazykových modelů pro extrakci informací s aplikacemi ve znalostních grafech
Author:	Adam, Daniel
Thesis type:	Bakalářská práce
Supervisor:	Kliegr, Tomáš
Opponents:	Zeman, Václav
Thesis language:	Česky
Abstract:	Bakalářská práce se zaměřuje na využití velkých jazykových modelů pro extrakci informací s aplikacemi ve znalostních grafech. V teoretické části je provedena rešerše na využití a definování extrakce informací. Dále je provedena rešerše na téma znalostních grafů s příklady využití jazykových modelů. Část je věnována datovým zdrojům a jejich důvěryhodnosti. Praktická část se věnuje validačnímu skriptu pro verifikaci RDF tvrzení z portálu Wikidata. Byl zjištěn aktuální stav jazykových modelů s ohledem na extrakci informací a znalostní grafy. Práce popisuje rozdíly mezi méně výkonnými a více výkonnými jazykovými modely, které se mohou projevovat například rychlostí generování odpovědí nebo kvalitou usuzování. Nakonec byl implementován skript v jazyce Python, který byl otestován na 3 Wikidata subjektech za použití jazykových modelů ChatGPT-3 a ChatGPT-4. Byla provedena ruční předběžná evaluace, která zjistila, že model ChatGPT-4 odpovídá lépe a přesněji než model ChatGPT-3, ale naznačila možnost jejich zkombinování pro zajištění rychlejších a přesnějších výsledků.
Keywords:	Velké jazykové modely; RDF tvrzení; ChatGPT

Thesis title:	Using large language models for information extraction with applications in knowledge graphs
Author:	Adam, Daniel
Thesis type:	Bachelor thesis
Supervisor:	Kliegr, Tomáš
Opponents:	Zeman, Václav
Thesis language:	Česky
Abstract:	The bachelor’s thesis focuses on using large language models for information extraction with applications in knowledge graphs. The theoretical part offers research on the definition of information extraction and its types. Next, research about knowledge graphs and using large language models is conducted. The thesis contains a subsection given to data sources and reliable data websites, introducing a Wikipedia ranking list. The practical part of the thesis focuses on a validation script in Python for verifying RDF statements from Wikidata. The thesis covers the current state of using large language models for information extraction and knowledge graph engineering. It shows the differences between less and more capable language models. A Python script was implemented and tested on 3 Wikidata subjects. A manual evaluation was performed which again showed a gap between different language models but on the other hand suggested a possible combination of different models to optimize the process and deliver better and faster results.
Keywords:	Large language models; RDF statements; ChatGPT

Information about study

Study programme:	Aplikovaná informatika
Type of study programme:	Bakalářský studijní program
Assigned degree:	Bc.
Institutions assigning academic degree:	Vysoká škola ekonomická v Praze
Faculty:	Faculty of Informatics and Statistics
Department:	Department of Information and Knowledge Engineering

Information on submission and defense

Date of assignment:	22. 9. 2023
Date of submission:	5. 5. 2024
Date of defense:	14. 6. 2024
Identifier in the InSIS system:	https://insis.vse.cz/zp/85544/podrobnosti

Files for download

Main text
85544_adad05.pdf, 2.9 MB Download

Public annex
28482_adad05.zip, 480.1 kB Download

Opponent's review
82405_qzemv01.pdf, 129.1 kB Download

Supervisor's review
85544_klit01.pdf, 104.1 kB Download