Processing Big Data on Spark and Databricks

English
Česky

Thesis title:	Processing Big Data on Spark and Databricks
Author:	Nguyen, Viet ha
Thesis type:	Diploma thesis
Supervisor:	Pavlíček, Antonín
Opponents:	Emelianov, Vladimir
Thesis language:	English
Abstract:	The technology is evolving at an unprecendented speed and so is the data. Many companies built their business around the data as it has become one of the key factors for being successful or not. With the overwhelming amount of data, Big Data has come to arise. Processing Big Data proved to be a challenge for many companies and it has given a reason for many developers to create new technologies, that would allow a seamless processing, storing and management of not only Big Data, but data as a whole. Among the many competititors, the Apache Spark has come out as one of the leading technology. Thanks to the contribution of the Apache Spark community and the founders, it has become a must-have when it came to Big Data processing. To simplify the processing even further, it went as far as developing a Spark-based platform called Databricks. This work focuses on the comparison between the Spark and the Databricks processings side by side in a form of benchmarks, because choosing the right platform for Big Data is a key factor to business.
Keywords:	Big Data; Apache Spark; Databricks

Thesis title:	Zpracování Big Dat na Sparku a Databricks
Author:	Nguyen, Viet ha
Thesis type:	Diplomová práce
Supervisor:	Pavlíček, Antonín
Opponents:	Emelianov, Vladimir
Thesis language:	English
Abstract:	Technologie se neustále vyvíjí společně s daty. Mnohé společnosti se základají na datech, jelikož se jedná o jeden z klíčových faktorů úspěšnosti. S příchodem obrovského množství dat se objevil i pojem Big Data. Práce s Big Daty byla obtížná pro spousty společností a stala se důvodem vývoje nových technologií, které by umožňovaly bezproblémové zpracovávání, ukládání a řízení nejenom Big Dat. Mezi různými technologiemi se stal Apache Spark jako jeden z hlavních. Díky své komunitě a zakladatelům, je považován za nedílnou součást zpracovávání Big Dat. Pro zjednodušení práce, byla vytvořena platforma nazývaná Databricks. Tato práce se soustřeďuje na porovnání rozdílů mezi Sparkem a Databricks. To je provedeno formou benchmarků, neboť volba správné platformy je základ úspěchu v Big Datech.
Keywords:	Apache Spark; Databricks; Big Data

Information about study

Study programme:	Aplikovaná informatika/Informační management
Type of study programme:	Magisterský studijní program
Assigned degree:	Ing.
Institutions assigning academic degree:	Vysoká škola ekonomická v Praze
Faculty:	Faculty of Informatics and Statistics
Department:	Department of Systems Analysis

Information on submission and defense

Date of assignment:	27. 3. 2018
Date of submission:	24. 4. 2019
Date of defense:	4. 6. 2019
Identifier in the InSIS system:	https://insis.vse.cz/zp/65619/podrobnosti

Files for download

Main text
65619_xnguv19.pdf, 2.3 MB Download

Opponent's review
60847_Emelianov.pdf, 63 kB Download

Supervisor's review
65619_pavlant.pdf, 62.7 kB Download