Hdfs Kubernetes
BootcampBigdata2020-12-17
HDFS on K8S
1. Kubernetes intro
- cluster manager
- run in Linux containers
- virtual, independent, micro-service
- a unique virtual IP
- a entire range of ports
Pod - a unit of scheduling and isolation
- runs a user program in a primary container
- holds isolation layers like a virtual IP in an infra container
2. Big Data on Kubernetes

3. Demo
- Label cluster nodes
- Stand up HDFS
- Launch a Spark job
- Check Spark job output
4. HDFS data locality

5. Reference
- github.com/apahce-spark-on-k8s
- AWS EC2 cluster
- Yarn vs. Kubernetes
- Spark Yarn
- S3, Elastic MapReduce
- ElastiCache, Lambda
- Cloud Native
- PaaS - Platform as a Service
Modern Big Data Pipelines over Kubernetes
Cloud
- Microsoft Azure (天蓝)
- Amazon webservices
- Google Cloud Platform
Data Services
- Unstructured: S3
- Structured: DynamoDB, Cassandra
- Stream: Kafka, Kinesis
Serverless


