• Bootcamp (9)
    • šŸ“± 236 - 992 - 3846

      šŸ“§ jxjwilliam@gmail.com

    • Version: ā€šŸš€ 1.1.0
  • Spark

    BootcampBigdata22021-02-16


    What is Spark

    • Apache Spark is an open-souce cluster-computing framework for real time processing,
    • general-purpse cluster
    • in-memory computing system
    • used for fast data analytics
    • abstracts APIs in Java, Scala, Python and R and provides an optimized engine that supports general execution graphs
    • Provides various high level tools like Spark SQL for structured data processing, MLlib for Machine Learning and more

    Why Apache Spark

    Spark Features

    Apache Spark is an open-source cluster-computing framework for real time processing

    • Speed
    • Advanced Analytics
    • Real-time
    • Powerful Caching
    • Deployment

    Spark Architecture Overview

    • Driver Node: Spark Context
    • Cluster Manager
    • Workers: Executors

    Spark Ecosystem

    1- Storage

    1. Local FS
    2. HDFS
    3. Amazon S3
    4. RDBMS
    5. NoSQL

    2- Management

    1. Yarn
    2. MESOS
    3. Spark

    3- Engine

    1. Spark Core Engine The core engine for entire Spark framework. Provides utilities and architecture for other components

    4- Library

    1. Spark SQL (SQL)
    2. Used for structured data. Can run unmodified hive queries on existing Hadoop deployment
    3. Can expose many datasets as tables
    4. Can be integrated with Hive
    5. Spark Streaming (Streaming)
    6. Enables analytical and interactive apps for live streaming data Spark Streaming is used to stream *real-time* data from various sources like Twitter, Stock Market and Geographical Systems and perform powerful analytics to help businesses.
    7. A good alternative of Storm
    8. MLlib (Machine Learning) Machine learning libraries being built on top of Spark
    9. GraphX (Graph Computation) Graph computation engine (Similar to Giraph). Combnes data-parallel and graph-parallel concepts
    10. SparkR Package for R language to enable R-users to leverage Spark power from R shell

    5- Programming

    1. Scala
    2. Java
    3. Python
    4. R language SparkR (R on Spark): Package for R language to enable R-users to leverage Spark power from R shell. Machine-learning, data analycis

    RDD: Resilient Distributed Data-Sets

    1. What is RDD? RDDs represent a collection of items distrbuted across many compute nodes that can be mainpulated in parallel. They are Spark’s main programming abstraction

    Fundamental data structure

    1. Features of RDD
    2. in-Memory Computation
    3. Lazy Evaluation
    4. Fault Tolerant
    5. Immutability
    6. Partitioning
    7. Persistence
    8. Coarse Grained Operations
    9. Ways to create RDD
    10. -
    11. -
    12. RDD Operations

    Discretized stream ē¦»ę•£ēš„ęµ

    The fundamental stream unit is DStream which is basically a series of RDDs to process the real-time data

    Processing of RDD’s can happen in parallel on different worker nodes

    Common stateless transformations on DStreams

    • map
    • flatMap
    • filter
    • reduce
    • groupBy

    Examples

    1. Yahoo!
    2. Twitter

    Commands

    $ spark-shell #localhost:4040

    collections

    spark is 100x faster than hadoop mapreduce for certain applications

    Apache Spark with Apache Kafka

    integrating Spark Streaming with Apache Kafka

    • ā€˜Big data’ never stops!
    • Analyze data streams in real time, instead of in huge batch jobs daily
    • Analyzing streams of web log data to react to user behavior
    • Analyze streams of real-time sensor data for ā€œinternet of Thingsā€ stuff

      Input DStream, DStream Transformations, Output DStream | | Streaming Context -> DStream -> Caching -> Accumulators, Broadcast, Variables and Checkpoints