Spark
BootcampBigdata22021-02-16
What is Spark
- Apache Spark is an open-souce cluster-computing framework for real time processing,
- general-purpse cluster
- in-memory computing system
- used for fast data analytics
- abstracts APIs in Java, Scala, Python and R and provides an optimized engine that supports general execution graphs
- Provides various high level tools like Spark SQL for structured data processing, MLlib for Machine Learning and more
Why Apache Spark
Spark Features
Apache Spark is an open-source cluster-computing framework for real time processing
- Speed
- Advanced Analytics
- Real-time
- Powerful Caching
- Deployment
Spark Architecture Overview
- Driver Node: Spark Context
- Cluster Manager
- Workers: Executors
Spark Ecosystem
1- Storage
- Local FS
- HDFS
- Amazon S3
- RDBMS
- NoSQL
2- Management
- Yarn
- MESOS
- Spark
3- Engine
- Spark Core Engine The core engine for entire Spark framework. Provides utilities and architecture for other components
4- Library
- Spark SQL (SQL)
- Used for structured data. Can run unmodified hive queries on existing Hadoop deployment
- Can expose many datasets as tables
- Can be integrated with Hive
- Spark Streaming (Streaming)
- Enables analytical and interactive apps for live streaming data
Spark Streaming is used to stream
*real-time*data from various sources like Twitter, Stock Market and Geographical Systems and perform powerful analytics to help businesses. - A good alternative of Storm
- MLlib (Machine Learning) Machine learning libraries being built on top of Spark
- GraphX (Graph Computation) Graph computation engine (Similar to Giraph). Combnes data-parallel and graph-parallel concepts
- SparkR Package for R language to enable R-users to leverage Spark power from R shell
5- Programming
- Scala
- Java
- Python
- R language SparkR (R on Spark): Package for R language to enable R-users to leverage Spark power from R shell. Machine-learning, data analycis
RDD: Resilient Distributed Data-Sets
- What is RDD? RDDs represent a collection of items distrbuted across many compute nodes that can be mainpulated in parallel. They are Sparkās main programming abstraction
Fundamental data structure
- Features of RDD
- in-Memory Computation
- Lazy Evaluation
- Fault Tolerant
- Immutability
- Partitioning
- Persistence
- Coarse Grained Operations
- Ways to create RDD
- -
- -
- RDD Operations
Discretized stream 离ę£ēęµ
The fundamental stream unit is DStream which is basically a series of RDDs to process the real-time data
Processing of RDDās can happen in parallel on different worker nodes
Common stateless transformations on DStreams
- map
- flatMap
- filter
- reduce
- groupBy
Examples
- Yahoo!
Commands
$ spark-shell #localhost:4040collections
spark is 100x faster than hadoop mapreduce for certain applications
Apache Spark with Apache Kafka
integrating Spark Streaming with Apache Kafka
- āBig dataā never stops!
- Analyze data streams in real time, instead of in huge batch jobs daily
- Analyzing streams of web log data to react to user behavior
-
Analyze streams of real-time sensor data for āinternet of Thingsā stuff
Input DStream, DStream Transformations, Output DStream | | Streaming Context -> DStream -> Caching -> Accumulators, Broadcast, Variables and Checkpoints
