• Bootcamp (9)
    • ๐Ÿ“ฑ 236 - 992 - 3846

      ๐Ÿ“ง jxjwilliam@gmail.com

    • Version: โ€๐Ÿš€ 1.1.0
  • 201 Kafka 4

    BootcampBigdata2020-12-17


    Kafka

    Apache Kafka is a distributed streaming platform

    Publish/Subscribe pattern

    What is Kafka

    • Apache Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system
    • Kafka is often used in place of traditional message brokers like JMS and AMQP because of its higher throughput, reliability and replication
    • Kafka may work in combination with Storm, Spark, Samza, Flink, etc. for real-time analysis and rendering of streaming data
    • Whatever the industry or use case, Kafka brokers massive message streams for low-latency analysis in Enterprise Apache Hadoop

    What does Kafka do?

    Apache Kafka supports a wide range of use cases where high throughput, reliable delivery, and horizontal scalability are important. Apache Spark and Apache Cassandra work very well in combination with Kafka.

    Typical use cases include:

    • Messaging
    • Stream processing: pipeline, transfer, aggregate, lightweight library kafka stream, in place of Apache storm, Apache Samsome
    • Metrics collection & monitoring
    • Website activity tracking: pageviews, searching, user activity capture to topic
    • Event sourcing (CDC)
    • Log aggregation
    • Commit log (log replication)

    Kafaka

    Qualities

    • Scalability
    • Durability (่€ไน…ๅŠ›)
    • Reliability
    • Performance

    High-level overview

    Kafaka

    Kafka architecture

    Kafaka

    Core APIs

    1. The Producer API allows an application to publish a stream of records to one or more Kafka topics.
    2. The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
    3. The Connect API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.
    4. The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, transforming the input streams to output streams.

    Kafaka

    Topics

    Topics

    Patitions

    • A topic consists 1 or more partitions
    • Each partition is ordered, immutable sequence of messages that is continually appended to a commit log

    Kafka nouns

    • topic, producer, consumer, broker, cluster,
    • zookeeper, storm library, pub/sub
    • Kafka is comparable to (็›ธๅฝ“ไบŽ) traditional messaging systems such as ActiveMQ
    • topics, patitions, Replicas, replication, in-sync replicas
    • leader, follower
    • load balance
    • consumer group
    • fault tolerance
    • log compaction, log cleaner
    • delivery semantics

    There are 3 ways topic can be created

    Topic Creation

    Data serialization formats

    • Kafka does not care about data format for a message payload
    • it is up to developer to handle serialization/deserialization.
    • common choices in practice: Avro, JSON

    Kafka connect

    Kafka connect

    • Connectors โ€“ the high level abstraction that coordinates data streaming by managing tasks
    • Tasks โ€“ the implementation of how data is copied to or from Kafka
    • Workers โ€“ the running processes that execute connectors and tasks: standlone and distributed.
    • Converters โ€“ the code used to translate data between Connect and the system sending or receiving data
    • Transforms โ€“ simple logic to alter each message produced by or sent to a connector

    Standard confluent connectors

    Standard confluent connectors

    System Tools

    • Kafka Manager
    • Consumer Offset Checker
    • Dump Log Segment
    • Export Zookeeper Offsets
    • Get Offset Shell
    • Import Zookeeper Offsets
    • JMX Tool
    • Kafka Migration Tool
    • Mirror Maker
    • Replay Log Producer
    • Simple Consumer Shell
    • State Change Log Merger
    • Update Offsets In Zookeeper
    • Verify Consumer Rebalance

    Monitoring & Configuration

    Use of standard monitoring tools is recommended

    Collect logging files into a central place

    • Logstash/Kibana and friends
    • Helps with troubleshooting, debugging, etc. โ€“ notably if you can correlate logging data with numeric metrics

    Q/A

    • ISR: Intra-cluster Replication

    Ecosystem