• Bootcamp (9)
    • 📱 236 - 992 - 3846

      📧 jxjwilliam@gmail.com

    • Version: ‍🚀 1.1.0
  • Hdfs Kubernetes

    BootcampBigdata2020-12-17


    HDFS on K8S

    1. Kubernetes intro

    • cluster manager
    • run in Linux containers
    • virtual, independent, micro-service
    • a unique virtual IP
    • a entire range of ports

    Pod - a unit of scheduling and isolation

    • runs a user program in a primary container
    • holds isolation layers like a virtual IP in an infra container

    2. Big Data on Kubernetes

    HDFS on Kubernetes

    3. Demo

    1. Label cluster nodes
    2. Stand up HDFS
    3. Launch a Spark job
    4. Check Spark job output

    4. HDFS data locality

    HDFS data locality

    5. Reference

    • github.com/apahce-spark-on-k8s
    • AWS EC2 cluster
    • Yarn vs. Kubernetes
    • Spark Yarn
    • S3, Elastic MapReduce
    • ElastiCache, Lambda
    • Cloud Native
    • PaaS - Platform as a Service

    Modern Big Data Pipelines over Kubernetes

    Cloud

    • Microsoft Azure (天蓝)
    • Amazon webservices
    • Google Cloud Platform

    Data Services

    • Unstructured: S3
    • Structured: DynamoDB, Cassandra
    • Stream: Kafka, Kinesis

    Serverless

    Modern Big Data Pipelines

    serverless