BERT models in document classification

Thesis title: BERT models in document classification
Author: Khateeb, Ahmad Arsalan
Thesis type: Diploma thesis
Supervisor: Kliegr, Tomáš
Opponents: Beranová, Lucie
Thesis language: English
Abstract:
Document classification or categorization with algorithms is a well-known problem in information science that falls under the umbrella term known as natural language processing (NLP). After extracting useful features from the text in a document, machine learning algorithms can be used for this task. This technique can be used for sentiment analysis, spam filtering, and topic labeling. With recent advances in the field of deep learning, neural network-based architectures have achieved state-of-the-art results for many NLP tasks. This thesis deals with the document classification of a COVID-19 Open Research Dataset (CORD-19) subset. The dataset used contains abstracts of scientific articles related to the coronavirus. The goal is to predict how well a paper is cited based on the text within its abstract. Bidirectional Encoder Representations from Transformers or BERT and a related model are used to perform this task. Their performance is compared to much simpler and faster to train conventional machine learning techniques, and the difference in performance is tested for any statistically significant difference. This thesis aims to achieve three main goals: reviewing the literature on document classification using BERT and other machine learning techniques, a theoretical discussion of the methodology used for classification, and a discussion of the results obtained on the mentioned dataset.
Keywords: machine learning; transformers; BERT; Covid-19; deep learning; neural networks; document classification
Thesis title: BERT Models in Document Classification
Author: Khateeb, Ahmad Arsalan
Thesis type: Diplomová práce
Supervisor: Kliegr, Tomáš
Opponents: Beranová, Lucie
Thesis language: English
Abstract:
Document classification or categorization with algorithms is a well-known problem in information science that falls under the umbrella term known as natural language processing (NLP). After extracting useful features from the text in a document, machine learning algorithms can be used for this task. This technique can be used for sentiment analysis, spam filtering, and topic labeling. With recent advances in the field of deep learning, neural network-based architectures have achieved state-of-the-art results for many NLP tasks. This thesis deals with the document classification of a COVID-19 Open Research Dataset (CORD-19) subset. The dataset used contains abstracts of scientific articles related to the coronavirus. The goal is to predict how well a paper is cited based on the text within its abstract. Bidirectional Encoder Representations from Transformers or BERT and a related model are used to perform this task. Their performance is compared to much simpler and faster to train conventional machine learning techniques, and the difference in performance is tested for any statistically significant difference. This thesis aims to achieve three main goals: reviewing the literature on document classification using BERT and other machine learning techniques, a theoretical discussion of the methodology used for classification, and a discussion of the results obtained on the mentioned dataset.
Keywords: deep learning; neural networks; Covid-19; machine learning; document classification; transformers; BERT

Information about study

Study programme: Economic Data Analysis/Data Analysis and Modeling
Type of study programme: Magisterský studijní program
Assigned degree: Ing.
Institutions assigning academic degree: Vysoká škola ekonomická v Praze
Faculty: Faculty of Informatics and Statistics
Department: Department of Information and Knowledge Engineering

Information on submission and defense

Date of assignment: 18. 10. 2021
Date of submission: 30. 6. 2022
Date of defense: 25. 8. 2022
Identifier in the InSIS system: https://insis.vse.cz/zp/78427/podrobnosti

Files for download

    Last update: