6 Spark

BootcampBigdata2020-12-17

Spark

Apache Spark is an open-source cluster-computing framework designed for speed and ease of use.

Everything in Memory
Up to 100x faster than MapReduce
Runs on Hadoop, Mesos, standalone, or in the cloud
Support for many programming langugages

1. Spark vs. Hadoop MapReduce

	Hadoop MapReduce	Apache Spark
Language Support	Java,C/C++,Ruby,Pythong,…	Scala,Java,Python,R,SQL
Developed	Java	Scala
Latency	disk oriented	memory oriented
Category	data processing engine	data analytics engine
data processing	batch	batch,streaming
fault tolerance	replication	RDD

2. Spark Architecture and Components

Spark Core - The heart of Apache Spark is the Spark Core. It provides the distributed task dispatching, scheduling, basic input and output operations, and the RDD abstraction and APIs to manipulate it. It interacts with its scheduler to schedule tasks and it interacts with a cluster manager to send tasks to machines to be executed. The few cluster managers (Apache Mesos, Hadoop YARN, and recently Kubernetes), manage the underlying data that we want to analyze.
Spark SQL – A new component which replaces the older Shark (SQL on Spark) project, this package provides better integration with Spark Core, it allows querying data through SQL and HiveQL and supports many data sources from Hive tables, Parquet and JSON. Spark SQL also allows developers to intermix SQL queries with the code for data manipulations with RDDs in Python, Java, and Scala. It also provides fast SQL connectivity to BI tools like Tableau or QlikView.
Spark Streaming – based on micro-batching, this component enables processing of real-time streaming data. It uses DStreams, which are series of RDDs, to process real-time data. The Spark Streaming API is very similar to the Spark Core RDD APIs, making it easy for developers to reuse and adapt code for batch to interactive or real-time applications.
MLlib – provides a library of machine learning algorithms including classification, regression, clustering, and collaborative filtering, as well as model evaluation and data import.
GraphFrames – which provides dataframe-based graphs. It aims to provide both the functionality of GraphX (which is now deprecated) and extended functionality taking advantage of spark data frames. This extended functionality includes motif finding, dataframe-based serialization and highly expressive graph queries.

Spark Installation

Java SE Development Kit 8u161 - JDK v1.8.0
gradle, maven
scalar, swift, kotlin
artifacts

Scala vs Kotlin:

Scala and Kotlin are in quite a tug of war. Scala has the edge over Kotlin in some ways, but Kotlin is just as formidable in others. The main differences — where the two languages set themselves apart — is that Kotlin is more like a better version of Java, while Scala is an entirely different kind of Java, so to speak.

RDD

RDD (Resilient Distributed Dataset) are fault-rolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.

Spark Actions

Spark ETL (ECL - Cleansing)

For different data source, the solution is to use a Data warehourse to store information from different sources in a uniform structure using ETL.

In Production environment, it will be extremely rare that you will be working on a local filesystem and chances are high that you will be working on distributed file systems such as HDFS and Amazon S3

Hadoop Distributed File System (HDFS) is a distributed, scalable, and portable filesystem written in Java for Hadoop framework. Spark allows you to read data from HDFS in a very similar way that you would read from a typical filesystem, with the only difference being pointing towards the NameNode and the HDFS port.

S3 stands for Simple Storage Service, an online storage service provided by Amazon Web Services. The core principles of S3 include scalability, high-availability, low-latency, and low-pricing. S3 provides amazing speed when your cluster is inside Amazon EC2, but the performance can be a nightmare if you are accessing large amounts of data over public Internet

NoSQL

The most popular No SQL databases include:
* Cassandra
* Hbase
* MongoDB
* Solr
* Couchbase

Couchbase works with Spark, Kafka, Hadoop, Elasticsearch, Solr, JDBC.

Spark Core Part 2

DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.

Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row.

RDD, DataFrame, DataSet

Spark on Kubernetes

Build management at Glance

(1)

SBT (Simple Build Tool), Scala, DSL,
Gradle (Groovy)
Maven, JVM language Scala, XML

(2)

yaml, yml

(3) Running in Standalone Mode

Spark Cluster (Spark Master is running :7077)
Spark Workers (web UI :8080)
spark-submit to submit application
Spark Driver -> Web UI Spark UI is available only while Dirver is running.

Verb.

discretize - 离散化
shuffling 洗牌
coarse-grained 粗粒度
coalesce - 合并
Terminology 术语
resilient - 弹性
experimental - 试验
coalesce - 合并
Monolithic - 单片
artifact 神器