BERT models in document classification
Thesis title: | BERT models in document classification |
---|---|
Author: | Khateeb, Ahmad Arsalan |
Thesis type: | Diploma thesis |
Supervisor: | Kliegr, Tomáš |
Opponents: | Beranová, Lucie |
Thesis language: | English |
Abstract: | Document classification or categorization with algorithms is a well-known problem in information science that falls under the umbrella term known as natural language processing (NLP). After extracting useful features from the text in a document, machine learning algorithms can be used for this task. This technique can be used for sentiment analysis, spam filtering, and topic labeling. With recent advances in the field of deep learning, neural network-based architectures have achieved state-of-the-art results for many NLP tasks. This thesis deals with the document classification of a COVID-19 Open Research Dataset (CORD-19) subset. The dataset used contains abstracts of scientific articles related to the coronavirus. The goal is to predict how well a paper is cited based on the text within its abstract. Bidirectional Encoder Representations from Transformers or BERT and a related model are used to perform this task. Their performance is compared to much simpler and faster to train conventional machine learning techniques, and the difference in performance is tested for any statistically significant difference. This thesis aims to achieve three main goals: reviewing the literature on document classification using BERT and other machine learning techniques, a theoretical discussion of the methodology used for classification, and a discussion of the results obtained on the mentioned dataset. |
Keywords: | machine learning; transformers; BERT; Covid-19; deep learning; neural networks; document classification |
Thesis title: | BERT Models in Document Classification |
---|---|
Author: | Khateeb, Ahmad Arsalan |
Thesis type: | Diplomová práce |
Supervisor: | Kliegr, Tomáš |
Opponents: | Beranová, Lucie |
Thesis language: | English |
Abstract: | Document classification or categorization with algorithms is a well-known problem in information science that falls under the umbrella term known as natural language processing (NLP). After extracting useful features from the text in a document, machine learning algorithms can be used for this task. This technique can be used for sentiment analysis, spam filtering, and topic labeling. With recent advances in the field of deep learning, neural network-based architectures have achieved state-of-the-art results for many NLP tasks. This thesis deals with the document classification of a COVID-19 Open Research Dataset (CORD-19) subset. The dataset used contains abstracts of scientific articles related to the coronavirus. The goal is to predict how well a paper is cited based on the text within its abstract. Bidirectional Encoder Representations from Transformers or BERT and a related model are used to perform this task. Their performance is compared to much simpler and faster to train conventional machine learning techniques, and the difference in performance is tested for any statistically significant difference. This thesis aims to achieve three main goals: reviewing the literature on document classification using BERT and other machine learning techniques, a theoretical discussion of the methodology used for classification, and a discussion of the results obtained on the mentioned dataset. |
Keywords: | deep learning; neural networks; Covid-19; machine learning; document classification; transformers; BERT |
Information about study
Study programme: | Economic Data Analysis/Data Analysis and Modeling |
---|---|
Type of study programme: | Magisterský studijní program |
Assigned degree: | Ing. |
Institutions assigning academic degree: | Vysoká škola ekonomická v Praze |
Faculty: | Faculty of Informatics and Statistics |
Department: | Department of Information and Knowledge Engineering |
Information on submission and defense
Date of assignment: | 18. 10. 2021 |
---|---|
Date of submission: | 30. 6. 2022 |
Date of defense: | 25. 8. 2022 |
Identifier in the InSIS system: | https://insis.vse.cz/zp/78427/podrobnosti |