Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Java
Open Source
designed on commodity hardware
Master-slave design
Which node is responsible for High-Availability in Hadoop: Standby Node
Secondary NameNode: Secondary NameNode in hadoop is a specially dedicated node in HDFS cluster whose main function is to take checkpoints of the file system metadata present on namenode. It is not a backup namenode. It just checkpoints namenode’s file system namespace.

5 V’s: variety, volume, velocity,

`Pros`

Scalability
Flxibility
Fault-tolerance
Computing power

`Cons`

Security concerns
Overkill for small data
Potential stability issues
High costs for infrastructure

Ecosystem

Oozie: Work flow
HCatalog: Table and schema management
Pig: Scripting
Hive: SQL query, S3, hdfs

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives a SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
Mahout: Machine learning
Drill: Interactive analysis
AVRO: JSON
Thrift: Cross Lanaguage Service
HBase: Columnar store
Sqoop: Data collection
Flume: Data collection
Zookeeper: Coordination
Apache Ambari: Management and monitoring
Mapreduce: Data Processing (batch processing: Spark)
Yarn: Cluster resource management
HDFS: Hadoop distibuted file system

What do Pig, Hive and Spark have in common? They’re programming alternatives to MapReduce

Types of libraries

Universal: MapReduce, HDFS, Kudu, Tez, Solr
Pipelining: Pig, Cascading
SQL Like: Hive, impala, Spark SQL
Graph Processing: Giraph, GraphX
Machine Learning: Mahout, Spark MLib
Stream Processing: Spark Streaming, Storm

Hadoop 3.0 includes docker containers.

Running Options

Open Source: hadoop - rarely used.
pre-built Enterprise-ready distribution
- cloudera
- Hortonworks
- MAPR
Cloud-managed cluster
- Amazon EMR
- HDInsight
- AWS, Google and Azure have spot instances (can greatly reduce costs)

Most Hadoop vendors provide pre-configured images in popular cloud provider marketplaces (AWS, AZure, GCP)

HDP - Hortonworks data platform

- Governance Integration: 
Data workflow: Sqoop, Flume, Kafka, NFS, WebHDFS

- Tools: Zeppelin, AmbariUser Views, DSX
Data Access
- Security: Ranger, Knox, Atlas, HDFS Encription
- Operations: Ambari, Cloudbreak, ZooKeeper, Scheduling, Oozie

Data Access: 
- Batch: MapReduce
- Script: Pig
- Sql: Hive, Druid
- NoSql: HBase
- Stream: Storm
- Search: Solr
- In-Memory: Spark
- Others: BigSQL

YARN:
- Masters
- Slaves

Ambari:
- Install putty to be able to SSH into HDP Sandbox
- Open Putty
- Set hostname 127.0.0.1
- Set port 2222
- Connect, credentials root/Hadoop
- You will be prompted to change root password
- Then type ambari-admin-password-reset
- Type ambari-agent restart

2 Hadoop