• Bootcamp (9)
    • 📱 236 - 992 - 3846

      📧 jxjwilliam@gmail.com

    • Version: ‍🚀 1.1.0
  • 6 Spark

    BootcampBigdata2020-12-17


    Spark

    Apache Spark is an open-source cluster-computing framework designed for speed and ease of use.

    • Everything in Memory
    • Up to 100x faster than MapReduce
    • Runs on Hadoop, Mesos, standalone, or in the cloud
    • Support for many programming langugages

    1. Spark vs. Hadoop MapReduce

    Hadoop MapReduce Apache Spark
    Language Support Java,C/C++,Ruby,Pythong,… Scala,Java,Python,R,SQL
    Developed Java Scala
    Latency disk oriented memory oriented
    Category data processing engine data analytics engine
    data processing batch batch,streaming
    fault tolerance replication RDD

    Spark vs Hadoop MapReduce Spark vs Hadoop MapReduce

    2. Spark Architecture and Components

    Spark components

    • Spark Core - The heart of Apache Spark is the Spark Core. It provides the distributed task dispatching, scheduling, basic input and output operations, and the RDD abstraction and APIs to manipulate it. It interacts with its scheduler to schedule tasks and it interacts with a cluster manager to send tasks to machines to be executed. The few cluster managers (Apache Mesos, Hadoop YARN, and recently Kubernetes), manage the underlying data that we want to analyze.
    • Spark SQL – A new component which replaces the older Shark (SQL on Spark) project, this package provides better integration with Spark Core, it allows querying data through SQL and HiveQL and supports many data sources from Hive tables, Parquet and JSON. Spark SQL also allows developers to intermix SQL queries with the code for data manipulations with RDDs in Python, Java, and Scala. It also provides fast SQL connectivity to BI tools like Tableau or QlikView.
    • Spark Streaming – based on micro-batching, this component enables processing of real-time streaming data. It uses DStreams, which are series of RDDs, to process real-time data. The Spark Streaming API is very similar to the Spark Core RDD APIs, making it easy for developers to reuse and adapt code for batch to interactive or real-time applications.
    • MLlib – provides a library of machine learning algorithms including classification, regression, clustering, and collaborative filtering, as well as model evaluation and data import.
    • GraphFrames – which provides dataframe-based graphs. It aims to provide both the functionality of GraphX (which is now deprecated) and extended functionality taking advantage of spark data frames. This extended functionality includes motif finding, dataframe-based serialization and highly expressive graph queries.

    Spark Architecture

    Job terms

    Spark Installation

    • Java SE Development Kit 8u161 - JDK v1.8.0
    • gradle, maven
    • scalar, swift, kotlin
    • artifacts

    Scala vs Kotlin:

    Scala and Kotlin are in quite a tug of war. Scala has the edge over Kotlin in some ways, but Kotlin is just as formidable in others. The main differences — where the two languages set themselves apart — is that Kotlin is more like a better version of Java, while Scala is an entirely different kind of Java, so to speak.

    RDD

    RDD (Resilient Distributed Dataset) are fault-rolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.

    Rdd Features

    Spark Stream

    Micro Batches

    Spark Actions

    Spark ETL (ECL - Cleansing)

    Spark ETL

    For different data source, the solution is to use a Data warehourse to store information from different sources in a uniform structure using ETL.

    In Production environment, it will be extremely rare that you will be working on a local filesystem and chances are high that you will be working on distributed file systems such as HDFS and Amazon S3

    Hadoop Distributed File System (HDFS) is a distributed, scalable, and portable filesystem written in Java for Hadoop framework. Spark allows you to read data from HDFS in a very similar way that you would read from a typical filesystem, with the only difference being pointing towards the NameNode and the HDFS port.

    S3 stands for Simple Storage Service, an online storage service provided by Amazon Web Services. The core principles of S3 include scalability, high-availability, low-latency, and low-pricing. S3 provides amazing speed when your cluster is inside Amazon EC2, but the performance can be a nightmare if you are accessing large amounts of data over public Internet

    NoSQL

    The most popular No SQL databases include:
    * Cassandra
    * Hbase
    * MongoDB
    * Solr
    * Couchbase

    Couchbase works with Spark, Kafka, Hadoop, Elasticsearch, Solr, JDBC.

    Spark Core Part 2

    DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.

    Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row.

    RDD, DataFrame, DataSet

    Spark API conceptions

    Spark API comparasion

    Spark Structured API

    Spark on Kubernetes

    Spark on Kubernetes

    cluster-managers-comparison.png

    Build management at Glance

    (1)

    • SBT (Simple Build Tool), Scala, DSL,
    • Gradle (Groovy)
    • Maven, JVM language Scala, XML

    (2)

    • yaml, yml

    (3) Running in Standalone Mode

    • Spark Cluster (Spark Master is running :7077)
    • Spark Workers (web UI :8080)
    • spark-submit to submit application
    • Spark Driver -> Web UI Spark UI is available only while Dirver is running.

    Verb.

    • discretize - 离散化
    • shuffling 洗牌
    • coarse-grained 粗粒度
    • coalesce - 合并
    • Terminology 术语
    • resilient - 弹性
    • experimental - 试验
    • coalesce - 合并
    • Monolithic - 单片
    • artifact 神器