Usage of Large Language Models for Rules Induction based on the Examples from Text

Česky
English

Thesis title:	Využití velkých jazykových modelů pro indukci pravidel na základě příkladů z textu
Author:	Jurášek, Eduard
Thesis type:	Bakalářská práce
Supervisor:	Kliegr, Tomáš
Opponents:	Máša, Petr
Thesis language:	Česky
Abstract:	Předkládaná práce se zabývá možností využití velkých jazykových modelů pro hledání častých vzorů. V návaznosti na zadání práce, kterým bylo experimentálně ověřit schopnost LLM indukovat obecná pravidla popisující společné charakteristiky zadaného malého počtu příkladů, se práce skládá ze tří cílů. Hlavním cílem jsou experimenty s volně dostupnými jazykovými modely (LLM). Další cíle tvoří rešerše postupů pro LLMs a vyhodnocení experimentů včetně porovnání s existujícími přístupy pro nalézání pravidel za použití algoritmů Apriori a FP-Growth. V rámci práce byly provedeny experimenty se třemi LLMs (Copilot, Gemini a ChatGPT-3.5) v kombinacemi s osmi zadáními promptů zaměřenými na vypsání častých množin položek vyskytujících v několika verzích malého vstupního datasetu. Ve výchozí verzi obsahoval dataset všeobecně známé názvy sloupců i jejich hodnoty (názvy zvířat a jejich vlastností). V dalších verzích byly postupně tyto údaje zamaskovány a nahrazeny bezvýznamovými identifikátory. Nejlepších výsledků dosáhl Copilot, který dosahoval až 100 % správnosti u datasetu s bezvýznamovými identifikátory, ale výrazně nižší správnosti na původních datech. Podobný trend, kdy bezvýznamové identifikátory vedly k lepším výsledkům, byl patrný i u modelu ChatGPT-3.5 (počítá-li se výsledek jako aritmetický průměr všech osmi experimentů). Je otázkou, zda by tento postup fungoval i na větší datasety a jiné domény.
Keywords:	podpora; spolehlivost; ChatGPT; Bing Chat; Copilot; CSV; Google Bard; Google Gemini; GUHA; itemsety; asociační pravidla; algoritmus Apriori; algoritmus FP- Growth; Python; EasyMiner

Thesis title:	Usage of Large Language Models for Rules Induction based on the Examples from Text
Author:	Jurášek, Eduard
Thesis type:	Bachelor thesis
Supervisor:	Kliegr, Tomáš
Opponents:	Máša, Petr
Thesis language:	Česky
Abstract:	The presented thesis addresses the possibility of using large language models for frequent pattern searches. Following the terms of reference of the thesis, which was to experimentally verify the ability of the LLM to induce general rules describing common characteristics of a given small number of examples, the thesis consists of three objectives. The main objective is to experiment with freely available language models (LLMs). The other objectives consist of a search of the procedures for the LLMs and an evaluation of the experiments, including a comparison with an existing approach for rule finding compared to Apriori and FP-Growth algorithms. In this thesis, experiments were conducted with three LLMs (Copilot, Gemini, and ChatGPT-3.5) in combination with eight prompt tasks focused on writing the frequent groups of items that occur in some versions of the small input dataset. In the original version, the dataset consisted of familiar column names and their values (names of the animals and their characteristics). In other versions, the actual information was masked and replaced with meaningless identifiers. The best results were obtained from the Copilot model, which scored up to 100% accuracy in the datasets with meaningless identifiers, but significantly lower than in the original data. A similar trend, where meaningless identifiers lead to better results could be spotted in the ChatGPT-3.5 model (if the result counts as the arithmetical mean of all eight experiments). It is questionable if this procedure would work on larger datasets and other domains.
Keywords:	ChatGPT; Bing Chat; Copilot; CSV; Google Bard; Google Gemini; GUHA; itemsets; association rules; Apriori algorithm; FP-Growth algorithm; EasyMiner; support; confidence; Python

Information about study

Study programme:	Aplikovaná informatika
Type of study programme:	Bakalářský studijní program
Assigned degree:	Bc.
Institutions assigning academic degree:	Vysoká škola ekonomická v Praze
Faculty:	Faculty of Informatics and Statistics
Department:	Department of Information and Knowledge Engineering

Information on submission and defense

Date of assignment:	7. 12. 2023
Date of submission:	5. 5. 2024
Date of defense:	14. 6. 2024
Identifier in the InSIS system:	https://insis.vse.cz/zp/86831/podrobnosti

Files for download

Main text
86831_jure01.pdf, 1.8 MB Download

Opponent's review
82333_xmasp06.pdf, 104.5 kB Download

Supervisor's review
86831_klit01.pdf, 111.4 kB Download