BERT Models in Document Classification
Název práce: | BERT models in document classification |
---|---|
Autor(ka) práce: | Khateeb, Ahmad Arsalan |
Typ práce: | Diploma thesis |
Vedoucí práce: | Kliegr, Tomáš |
Oponenti práce: | Beranová, Lucie |
Jazyk práce: | English |
Abstrakt: | Document classification or categorization with algorithms is a well-known problem in information science that falls under the umbrella term known as natural language processing (NLP). After extracting useful features from the text in a document, machine learning algorithms can be used for this task. This technique can be used for sentiment analysis, spam filtering, and topic labeling. With recent advances in the field of deep learning, neural network-based architectures have achieved state-of-the-art results for many NLP tasks. This thesis deals with the document classification of a COVID-19 Open Research Dataset (CORD-19) subset. The dataset used contains abstracts of scientific articles related to the coronavirus. The goal is to predict how well a paper is cited based on the text within its abstract. Bidirectional Encoder Representations from Transformers or BERT and a related model are used to perform this task. Their performance is compared to much simpler and faster to train conventional machine learning techniques, and the difference in performance is tested for any statistically significant difference. This thesis aims to achieve three main goals: reviewing the literature on document classification using BERT and other machine learning techniques, a theoretical discussion of the methodology used for classification, and a discussion of the results obtained on the mentioned dataset. |
Klíčová slova: | machine learning; transformers; BERT; Covid-19; deep learning; neural networks; document classification |
Název práce: | BERT Models in Document Classification |
---|---|
Autor(ka) práce: | Khateeb, Ahmad Arsalan |
Typ práce: | Diplomová práce |
Vedoucí práce: | Kliegr, Tomáš |
Oponenti práce: | Beranová, Lucie |
Jazyk práce: | English |
Abstrakt: | Document classification or categorization with algorithms is a well-known problem in information science that falls under the umbrella term known as natural language processing (NLP). After extracting useful features from the text in a document, machine learning algorithms can be used for this task. This technique can be used for sentiment analysis, spam filtering, and topic labeling. With recent advances in the field of deep learning, neural network-based architectures have achieved state-of-the-art results for many NLP tasks. This thesis deals with the document classification of a COVID-19 Open Research Dataset (CORD-19) subset. The dataset used contains abstracts of scientific articles related to the coronavirus. The goal is to predict how well a paper is cited based on the text within its abstract. Bidirectional Encoder Representations from Transformers or BERT and a related model are used to perform this task. Their performance is compared to much simpler and faster to train conventional machine learning techniques, and the difference in performance is tested for any statistically significant difference. This thesis aims to achieve three main goals: reviewing the literature on document classification using BERT and other machine learning techniques, a theoretical discussion of the methodology used for classification, and a discussion of the results obtained on the mentioned dataset. |
Klíčová slova: | deep learning; neural networks; Covid-19; machine learning; document classification; transformers; BERT |
Informace o studiu
Studijní program / obor: | Economic Data Analysis/Data Analysis and Modeling |
---|---|
Typ studijního programu: | Magisterský studijní program |
Přidělovaná hodnost: | Ing. |
Instituce přidělující hodnost: | Vysoká škola ekonomická v Praze |
Fakulta: | Fakulta informatiky a statistiky |
Katedra: | Katedra informačního a znalostního inženýrství |
Informace o odevzdání a obhajobě
Datum zadání práce: | 18. 10. 2021 |
---|---|
Datum podání práce: | 30. 6. 2022 |
Datum obhajoby: | 25. 8. 2022 |
Identifikátor v systému InSIS: | https://insis.vse.cz/zp/78427/podrobnosti |