BERT Models in Document Classification

Název práce: BERT models in document classification
Autor(ka) práce: Khateeb, Ahmad Arsalan
Typ práce: Diploma thesis
Vedoucí práce: Kliegr, Tomáš
Oponenti práce: Beranová, Lucie
Jazyk práce: English
Abstrakt:
Document classification or categorization with algorithms is a well-known problem in information science that falls under the umbrella term known as natural language processing (NLP). After extracting useful features from the text in a document, machine learning algorithms can be used for this task. This technique can be used for sentiment analysis, spam filtering, and topic labeling. With recent advances in the field of deep learning, neural network-based architectures have achieved state-of-the-art results for many NLP tasks. This thesis deals with the document classification of a COVID-19 Open Research Dataset (CORD-19) subset. The dataset used contains abstracts of scientific articles related to the coronavirus. The goal is to predict how well a paper is cited based on the text within its abstract. Bidirectional Encoder Representations from Transformers or BERT and a related model are used to perform this task. Their performance is compared to much simpler and faster to train conventional machine learning techniques, and the difference in performance is tested for any statistically significant difference. This thesis aims to achieve three main goals: reviewing the literature on document classification using BERT and other machine learning techniques, a theoretical discussion of the methodology used for classification, and a discussion of the results obtained on the mentioned dataset.
Klíčová slova: machine learning; transformers; BERT; Covid-19; deep learning; neural networks; document classification
Název práce: BERT Models in Document Classification
Autor(ka) práce: Khateeb, Ahmad Arsalan
Typ práce: Diplomová práce
Vedoucí práce: Kliegr, Tomáš
Oponenti práce: Beranová, Lucie
Jazyk práce: English
Abstrakt:
Document classification or categorization with algorithms is a well-known problem in information science that falls under the umbrella term known as natural language processing (NLP). After extracting useful features from the text in a document, machine learning algorithms can be used for this task. This technique can be used for sentiment analysis, spam filtering, and topic labeling. With recent advances in the field of deep learning, neural network-based architectures have achieved state-of-the-art results for many NLP tasks. This thesis deals with the document classification of a COVID-19 Open Research Dataset (CORD-19) subset. The dataset used contains abstracts of scientific articles related to the coronavirus. The goal is to predict how well a paper is cited based on the text within its abstract. Bidirectional Encoder Representations from Transformers or BERT and a related model are used to perform this task. Their performance is compared to much simpler and faster to train conventional machine learning techniques, and the difference in performance is tested for any statistically significant difference. This thesis aims to achieve three main goals: reviewing the literature on document classification using BERT and other machine learning techniques, a theoretical discussion of the methodology used for classification, and a discussion of the results obtained on the mentioned dataset.
Klíčová slova: deep learning; neural networks; Covid-19; machine learning; document classification; transformers; BERT

Informace o studiu

Studijní program / obor: Economic Data Analysis/Data Analysis and Modeling
Typ studijního programu: Magisterský studijní program
Přidělovaná hodnost: Ing.
Instituce přidělující hodnost: Vysoká škola ekonomická v Praze
Fakulta: Fakulta informatiky a statistiky
Katedra: Katedra informačního a znalostního inženýrství

Informace o odevzdání a obhajobě

Datum zadání práce: 18. 10. 2021
Datum podání práce: 30. 6. 2022
Datum obhajoby: 25. 8. 2022
Identifikátor v systému InSIS: https://insis.vse.cz/zp/78427/podrobnosti

Soubory ke stažení

    Poslední aktualizace: