Bootcamp (9)

Ge Help

📱 236 - 992 - 3846

📧 jxjwilliam@gmail.com

Version

Version: ‍🚀 1.1.0

1 Bigdata

BootcampBigdata2020-12-17

5 V

volume
velocity
variety

data type

structured data: csv, tsv, DB
semi-structured data - 5-10%: logs, xml, json
unstructured data: email, social media, sounds, vedio,images, word

Speed of data

batch
micro-batch
real-time

Speed of data = from real time to batched processing

    File Storage: HDFS, ZFS

    No-SQL:
    - `Key-Value Store`: Redis, Aerospike
    - `In-memory`: Ignite
    - `Distributed DB`: Cassandra, HBase

    Tranditional DB: MySQL, Postgres

    Raw File Formats
    
    Typical access Pattern:

    - Use all the fields to validata, enhance, join data
    - Read through whole data set

    Formats:
    - Plain Text (unstructured)
    - Structured Text Data (Rows - CSV, TSV, JSON)
    - Binary files (images, videos)
    - Avro (Row oriented)
   
    Processed File Formats

    Typical access Pattern:
    - Use limited fields to aggregate data or run other analytical queries
    - Read filtered subset

    Formats:

    - Parquet (Columnar oriented)
    - ORC

Storage & Formats of data

Fow Format - OOTP, query faster
Columnar format - OLAP, historical/archive data

Data compression

Ideal Big Data solution - main technical characteristics

Be scalable
Fault Tolerant
Ensure highly availability
Ensure data is widely accessible, but secure
Support analytics, data science, and content applications
Support workflow automation
Integrate with legacy applications
Be self-healing (自愈)

Functions

Data Collection
Data Storage
Data Exploration
Data Governance
Data Product

Blue Print

ETL
Streaming
Batch
Ingestion
Exploratory Environment: development env using product data without impace product pipeline.

More

Cloudera and Hortonworks are merged

1 Bigdata

5 V

data type

Speed of data

Storage & Formats of data

Data compression

Ideal Big Data solution - main technical characteristics

Functions

Blue Print

More

Github Radar

2 Hadoop