Kafka

Apache Kafka is an open-source stream-processing software platform dveloped by the Apache Software Foundation written in Scala and Java

A “high throughput distributed messaging system”
Publish / subscribe
Fast, scalable, durable, distributed
Popular way to send data reliably over a cluster
Kafka Brokers, clusters, streaming, topics
producers - send
consumers - receive
broker - kafka server
cluster - group of computers
topic - a name for a kafka stream
partition - part of a topic
offset - unique id for a message within a partition
consumer groups - a group of consumer acting as a single logical unit

Kafka高效地处理实时流式数据，可以实现与Storm、HBase和Spark的集成。作为群集部署到多台服务器上，Kafka处理它所有的发布和订阅消息系统使用了四个API，即生产者API、消费者API、Stream API和Connector API。它能够传递大规模流式消息，自带容错功能，已经取代了一些传统消息系统，如JMS、AMQP等。

$ brew cask install homebrew/cask-versions/java8

$ brew install kafka

zookeeper

Zookeeper Servers to stroe metadata about brokers, topics and partitions. And Kafka provides a topics for a stream of records.

Spark

Apache Spark is a unified analytics engine for lare-scale data processing: batch, streaming, machine learning, graph computation. Access data in hundreds of data sources.

What Apache Spark can do:

Spark SQL and batch processing
Stram processing with Spark Streaming and Structured Streaming
Machine Learning with Mllib
Graph computations with GraphX

How to Install Scala and Apache Spark on MacOS

$ brew install scala

$ brew install apache-spark

$ spark-shell

Kafka with Spark Streaming

Kafka + Spark = Reliable, scalable event ingestion and real-time stream processing

Event stream processing architecture on Azure with Apache Kafka and Spark

As of Spark 13, Spark Streaming can connect directly to kafka

Kafka is already replicated and durable, so no need to recreate that functionality with Spark receivers!
Prior to Spark 1.3, you’d connect to a Zookeeper host instead, and it would be possible to lose data if that receiver went down. Need to get the spark-streaming-kafka package (it’s not built-in)

sbt

Commands

// install Java1.8的jdk
$ brew cask install homebrew/cask-versions/java8

$ brew install kafka

// To have launchd start kafka now and restart at login:
$ brew services start kafka

// Or, if you don't want/need a background service you can just run:

$ zookeeper-server-start /usr/local/etc/kafka/zookeeper.properties & kafka-server-start /usr/local/etc/kafka/server.properties

// To have launchd start zookeeper now and restart at login:
$ brew services start zookeeper

// Or, if you don't want/need a background service you can just run:
$ zkServer start

Kafka

zookeeper

Spark

Kafka with Spark Streaming

Commands

Hadoop

Scala