Design and Implementation of a Genomic Data Storing and Processing Solution

Česky
English

Thesis title:	Návrh a implementace řešení pro ukládání a zpracování genomických dat
Author:	Holub, Ondřej
Thesis type:	Bakalářská práce
Supervisor:	Karkošková, Soňa
Opponents:	Potančok, Martin
Thesis language:	Česky
Abstract:	Pokroky ve výzkumu sekvenování DNA stimulují prudké snižování ceny sekvenování lidského genomu a související nárůst objemu a komplexity výstupních genomických dat. Tento dlouhodobý trend je hybatelem rapidní transformace odvětví biomedicíny s cílem generovaná data efektivně využívat ve výzkumu a klinické péči. Hlavním cílem této práce je navrhnout řešení pro transformaci, anotování a distribuované uložení dat popisujících DNA sekvence a genetické varianty a implementovat parametrizovatelné pohledy pro srovnávání uložených vzorků v kontextu odborných anotací. Hlavního cíle práce je dosaženo naplněním tří dílčích cílů. Prvním dílčím cílem je popsat technologická omezení související s anotováním a analýzou genomických dat a z nich vyplývající požadavky v kontextu aktuálního stavu genomiky a souvisejících bigdatových metod a technologií. Druhým dílčím cílem je navrhnout integraci frameworku Apache Spark s vybranými doménově specifickými softwarovými nástroji odpovídající definované množině požadavků na zpracování genomických dat. Posledním dílčím cílem je demonstrovat předzpracování vstupních datových souborů a implementovat doménovou logiku pokročilých parametrizovatelných pohledů na data.
Keywords:	big data; Apache Spark; bioinformatika; genomika; zdravotnictví

Thesis title:	Design and Implementation of a Genomic Data Storing and Processing Solution
Author:	Holub, Ondřej
Thesis type:	Bachelor thesis
Supervisor:	Karkošková, Soňa
Opponents:	Potančok, Martin
Thesis language:	Česky
Abstract:	Advances in DNA sequencing research stimulate a steep cost reduction of human genome sequencing and an associated increase in the quantity and complexity of the output genomic data. This long-term trend drives rapid transformation in the field of biomedicine, aiming for the effective utilization of generated data in research and clinical care. The main objective of this thesis is to propose a solution for transforming and annotating data describing DNA sequences and genetic variants, as well as storing it in a distributed manner, and to implement parametrized data views enabling the comparison of stored samples in the context of expert annotations. The main objective is achieved by fulfilling three partial objectives. The first partial objective aims to describe technological limitations associated with annotating and analyzing genomic data as well as the arising requirements in the context of the present state of genomics and the associated big data methods and technologies. The second partial objective is to propose a draft integration of the Apache Spark framework and selected domain-specific software utilities, satisfying the defined set of requirements for genomic data processing. The final partial objective aims to demonstrate the pre-processing of input data files and to implement the domain logic of the advanced parameterized data views.
Keywords:	big data; Apache Spark; bioinformatics; genomics; healthcare

Information about study

Study programme:	Aplikovaná informatika/Aplikovaná informatika
Type of study programme:	Bakalářský studijní program
Assigned degree:	Bc.
Institutions assigning academic degree:	Vysoká škola ekonomická v Praze
Faculty:	Faculty of Informatics and Statistics
Department:	Department of Information Technologies

Information on submission and defense

Date of assignment:	6. 9. 2018
Date of submission:	10. 12. 2018
Date of defense:	22. 1. 2019
Identifier in the InSIS system:	https://insis.vse.cz/zp/66567/podrobnosti

Files for download

Main text
66567_holo00.pdf, 2.6 MB Download

Opponent's review
59479_xpotm03.pdf, 64.7 kB Download

Supervisor's review
66567_xkars05.pdf, 63 kB Download