• Bootcamp (9)
    • 📱 236 - 992 - 3846

      📧 jxjwilliam@gmail.com

    • Version: ‍🚀 1.1.0
  • 1 Bigdata

    BootcampBigdata2020-12-17


    5 V

    • volume
    • velocity
    • variety

    data type

    • structured data: csv, tsv, DB
    • semi-structured data - 5-10%: logs, xml, json
    • unstructured data: email, social media, sounds, vedio,images, word

    Speed of data

    • batch
    • micro-batch
    • real-time

    Speed of data = from real time to batched processing

        File Storage: HDFS, ZFS
    
        No-SQL:
        - `Key-Value Store`: Redis, Aerospike
        - `In-memory`: Ignite
        - `Distributed DB`: Cassandra, HBase
    
        Tranditional DB: MySQL, Postgres
    
        Raw File Formats
        
        Typical access Pattern:
    
        - Use all the fields to validata, enhance, join data
        - Read through whole data set
    
        Formats:
        - Plain Text (unstructured)
        - Structured Text Data (Rows - CSV, TSV, JSON)
        - Binary files (images, videos)
        - Avro (Row oriented)
       
        Processed File Formats
    
        Typical access Pattern:
        - Use limited fields to aggregate data or run other analytical queries
        - Read filtered subset
    
        Formats:
    
        - Parquet (Columnar oriented)
        - ORC
    

    Storage & formats of data

    Storage & Formats of data

    • Fow Format - OOTP, query faster
    • Columnar format - OLAP, historical/archive data

    Data compression

    Data Compression

    Ideal Big Data solution - main technical characteristics

    • Be scalable
    • Fault Tolerant
    • Ensure highly availability
    • Ensure data is widely accessible, but secure
    • Support analytics, data science, and content applications
    • Support workflow automation
    • Integrate with legacy applications
    • Be self-healing (自愈)

    Functions

    • Data Collection
    • Data Storage
    • Data Exploration
    • Data Governance
    • Data Product

    Storage & formats of data

    Blue Print

    • ETL
    • Streaming
    • Batch
    • Ingestion
    • Exploratory Environment: development env using product data without impace product pipeline.

    More

    • Cloudera and Hortonworks are merged