• Bootcamp (9)
    • 📱 236 - 992 - 3846

      📧 jxjwilliam@gmail.com

    • Version: ‍🚀 1.1.0
  • Kafka

    BootcampBigdata22020-12-17


    Kafka

    Apache Kafka is an open-source stream-processing software platform dveloped by the Apache Software Foundation written in Scala and Java

    1. A “high throughput distributed messaging system”
    2. Publish / subscribe
    3. Fast, scalable, durable, distributed
    4. Popular way to send data reliably over a cluster
    5. Kafka Brokers, clusters, streaming, topics
    6. producers - send
    7. consumers - receive
    8. broker - kafka server
    9. cluster - group of computers
    10. topic - a name for a kafka stream
    11. partition - part of a topic
    12. offset - unique id for a message within a partition
    13. consumer groups - a group of consumer acting as a single logical unit

    Kafka高效地处理实时流式数据,可以实现与Storm、HBase和Spark的集成。作为群集部署到多台服务器上,Kafka处理它所有的发布和订阅消息系统使用了四个API,即生产者API、消费者API、Stream API和Connector API。它能够传递大规模流式消息,自带容错功能,已经取代了一些传统消息系统,如JMS、AMQP等。

    $ brew cask install homebrew/cask-versions/java8
    
    $ brew install kafka

    zookeeper

    Zookeeper Servers to stroe metadata about brokers, topics and partitions. And Kafka provides a topics for a stream of records.

    Spark

    Apache Spark is a unified analytics engine for lare-scale data processing: batch, streaming, machine learning, graph computation. Access data in hundreds of data sources.

    What Apache Spark can do:

    1. Spark SQL and batch processing
    2. Stram processing with Spark Streaming and Structured Streaming
    3. Machine Learning with Mllib
    4. Graph computations with GraphX

    How to Install Scala and Apache Spark on MacOS

    $ brew install scala
    
    $ brew install apache-spark
    
    $ spark-shell

    Kafka with Spark Streaming

    Kafka + Spark = Reliable, scalable event ingestion and real-time stream processing

    Event stream processing architecture on Azure with Apache Kafka and Spark

    As of Spark 13, Spark Streaming can connect directly to kafka

    • Kafka is already replicated and durable, so no need to recreate that functionality with Spark receivers!
    • Prior to Spark 1.3, you’d connect to a Zookeeper host instead, and it would be possible to lose data if that receiver went down. Need to get the spark-streaming-kafka package (it’s not built-in)

    sbt

    Commands

    // install Java1.8的jdk
    $ brew cask install homebrew/cask-versions/java8
    
    $ brew install kafka
    
    // To have launchd start kafka now and restart at login:
    $ brew services start kafka
    
    // Or, if you don't want/need a background service you can just run:
    
    $ zookeeper-server-start /usr/local/etc/kafka/zookeeper.properties & kafka-server-start /usr/local/etc/kafka/server.properties
    
    // To have launchd start zookeeper now and restart at login:
    $ brew services start zookeeper
    
    // Or, if you don't want/need a background service you can just run:
    $ zkServer start