Bootcamp (9)

Ge Help

📱 236 - 992 - 3846

📧 jxjwilliam@gmail.com

Version

Version: ‍🚀 1.1.0

Spark

BootcampBigdata22021-02-16

What is Spark

Apache Spark is an open-souce cluster-computing framework for real time processing,
general-purpse cluster
in-memory computing system
used for fast data analytics
abstracts APIs in Java, Scala, Python and R and provides an optimized engine that supports general execution graphs
Provides various high level tools like Spark SQL for structured data processing, MLlib for Machine Learning and more

Why Apache Spark

Spark Features

Apache Spark is an open-source cluster-computing framework for real time processing

Speed
Advanced Analytics
Real-time
Powerful Caching
Deployment

Spark Architecture Overview

Driver Node: Spark Context
Cluster Manager
Workers: Executors

Spark Ecosystem

1- Storage

Local FS
HDFS
Amazon S3
RDBMS
NoSQL

2- Management

Yarn
MESOS
Spark

3- Engine

Spark Core Engine The core engine for entire Spark framework. Provides utilities and architecture for other components

4- Library

Spark SQL (SQL)
Used for structured data. Can run unmodified hive queries on existing Hadoop deployment
Can expose many datasets as tables
Can be integrated with Hive
Spark Streaming (Streaming)
Enables analytical and interactive apps for live streaming data Spark Streaming is used to stream *real-time* data from various sources like Twitter, Stock Market and Geographical Systems and perform powerful analytics to help businesses.
A good alternative of Storm
MLlib (Machine Learning) Machine learning libraries being built on top of Spark
GraphX (Graph Computation) Graph computation engine (Similar to Giraph). Combnes data-parallel and graph-parallel concepts
SparkR Package for R language to enable R-users to leverage Spark power from R shell

5- Programming

Scala
Java
Python
R language SparkR (R on Spark): Package for R language to enable R-users to leverage Spark power from R shell. Machine-learning, data analycis

RDD: Resilient Distributed Data-Sets

What is RDD? RDDs represent a collection of items distrbuted across many compute nodes that can be mainpulated in parallel. They are Spark’s main programming abstraction

Fundamental data structure

Features of RDD
in-Memory Computation
Lazy Evaluation
Fault Tolerant
Immutability
Partitioning
Persistence
Coarse Grained Operations
Ways to create RDD
-
-
RDD Operations

Discretized stream 离散的流

The fundamental stream unit is DStream which is basically a series of RDDs to process the real-time data

Processing of RDD’s can happen in parallel on different worker nodes

Common stateless transformations on DStreams

map
flatMap
filter
reduce
groupBy

Examples

Yahoo!
Twitter

Commands

$ spark-shell #localhost:4040

collections

spark is 100x faster than hadoop mapreduce for certain applications

Apache Spark with Apache Kafka

integrating Spark Streaming with Apache Kafka

‘Big data’ never stops!
Analyze data streams in real time, instead of in huge batch jobs daily
Analyzing streams of web log data to react to user behavior
Analyze streams of real-time sensor data for “internet of Things” stuff

Input DStream, DStream Transformations, Output DStream | | Streaming Context -> DStream -> Caching -> Accumulators, Broadcast, Variables and Checkpoints