# 4.1 Spark, Hortonworks, HDFS, CAP

## Spark

### Apache Spark

* Motivation: Traditional MapReduce & classical parallel runtimes cannot solve iterative algorithms efficiently
  * Hadoop: Repeated data access to HDFS, no optimizations to data caching & data transfers
  * MPI: No natural support for fault tolerance; programming interface is complicated
* Apache Spark: Extend the MapReduce model to better support two common classes of analytics apps:
  * Iterative algorithms (ML, graphs)
  * Interactive data mining
* Why are current frameworks not working?
  * Most cluster programming models use acyclic data flow (from stable storage to stable storage)
  * Acyclic data flow is inefficient for apps that repeatedly reuse a working set of data
* Solution: **Resilient Distributed Datasets (RDDs)**
  * Advantages
    * Allow apps to keep working sets in memory for efficient reuse
    * Retains the attractive properties of MapReduce (fault tolerance, data locality, scalability)
    * Supports a wide range of applications
  * Properties
    * Immutable, partitioned collections of objects
    * Created through parallel transformations (map, filter, groupBy, join) on data in stable storage
    * Can be cached for efficient reuse

### Example Spark Applications

![Log mining](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MgdUlOFHBAWBxgoqmhQ%2F-MgdaZIQmn72Ul6Epq8L%2FScreen%20Shot%202021-08-09%20at%202.29.35%20PM.png?alt=media\&token=e92b928c-8652-4cdf-bafe-f2ca7b01bb80)

![Logistic regression](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MgdUlOFHBAWBxgoqmhQ%2F-MgdagZZgm_wOw9_id6k%2FScreen%20Shot%202021-08-09%20at%202.30.11%20PM.png?alt=media\&token=00f98dc7-1518-4073-801e-a28abecd4c55)

### RDD Fault Tolerance

![RDDs maintain lineage information that can be used to reconstruct lost partitions](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MgdUlOFHBAWBxgoqmhQ%2F-Mgdb6Q5zFpQOkdWf2EG%2FScreen%20Shot%202021-08-09%20at%202.31.59%20PM.png?alt=media\&token=bfee48c5-5edc-4aaa-b046-505646014b0d)

## Big Data Distros (Distributions)

### Hortonworks

* Connected data strategy
  * HDP: Apache Hadoop is an open-source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data
  * HDF: Real-time data collection, curation, analysis, and delivery of data to and from any device, source or system, either on-premise and in the cloud

![Hortonworks Data Platform (HDP)](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MgdUlOFHBAWBxgoqmhQ%2F-Mgddl1alNQL7se_yHRj%2FScreen%20Shot%202021-08-09%20at%202.43.36%20PM.png?alt=media\&token=9a414859-c408-4021-a668-a58f8fb6f250)

* HDP tools
  * Apache Zeppelin: Open web-based notebook that enables interactive data analytics
  * Apache Ambari: Source management platform for provisioning, managing, monitoring, and securing Apache Hadoop clusters
* HDP data access
  * YARN: Data Operating System
    * MapReduce: Batch application framework for structured and unstructured data
    * Pig: Script ETL data pipelines, research on raw data, and iterative data processing
    * Hive: Interactive SQL queries over petabytes of data in Hadoop
    * Hbase Accumulo: Non-relational/NoSQL database on top of HDFS
    * Storm (Stream): Distributed real-time large volumes of high-velocity data
    * Solr (Search): Full-text search and near real-time indexing
    * Spark: In-memory
  * Data management: HDFS
* HDF
  * Apache NiFi, Kafka, and Storm: Provide real-time dataflow management and streaming analytics

### Cloudera

![Cloudera Enterprise Data Hub (EDH)](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MgdUlOFHBAWBxgoqmhQ%2F-MgdhYILVQW3M5Gw2ayD%2FScreen%20Shot%202021-08-09%20at%203.00.08%20PM.png?alt=media\&token=61ddb4ee-990c-40a0-948a-89773eb43d82)

### MapR

* Platforms for big data
  * MapReduce (Hadoop written in C/C++)
  * NFS
  * Interactive SQL (Drill, Hive Spark SQL, Impala)
  * MapR-DB
  * Search (Apache Solr)
  * Stream Processing (MapR Streams)

## HDFS

### HDFS

* HDFS properties
  * Synergistic w/ Hadoop
  * Massive throughput
  * Throughput scales with attached HDs
  * Have seen very large production clusters (Facebook, Yahoo)
  * Doesn't even pretend to be POSIX compliant
  * Optimized for reads, sequential writes, and appends
* How can we store data persistently? Ans: Distributed File System replicates files
* Distributed File System
  * Datanode Servers
    * A file is split into contiguous chunks (16-64MB), each of which is replicated (usually 2x or 3x)
    * Sends heartbeat and BlockReport to namenode
  * Replicas are placed: one on a node in a local rack, one on a different node in the local rack, and one on a node in a different rack (lots of back-ups)

![HDFS architecture](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MgdUlOFHBAWBxgoqmhQ%2F-MgdkHcPaKzeFoNq8aVb%2FScreen%20Shot%202021-08-09%20at%203.12.06%20PM.png?alt=media\&token=f6d2d02c-696e-4718-9612-9dece5e62411)

* Master node (namenode in HDFS) stores metadata, and might be replicated
  * Client libraries for file accesses talk to master to find datanode chunk, and then connect directly to datanode servers to access data
* Replication pipelining: Data is pipelined from datanode to the next in the background
* Staging: A client request to create a file does not reach namenode immediately. Instead, HDFS client caches the data into a temporary file -> once the data size reaches a HDFS block size, the client contacts the namenode -> namenode inserts the filename into its hierarchy and allocates a data block for it -> namenode responds to the client with the identity of the datanode and the destinations of the replicas/datanodes for the block -> client flushes from local memory

### YARN and Mesos

* Mesos: Built to be a scalable global resource manager for the entire datacenter

![Mesos architecture](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MgdUlOFHBAWBxgoqmhQ%2F-MgdoJo_AbvEfUWf8ic6%2FScreen%20Shot%202021-08-09%20at%203.29.40%20PM.png?alt=media\&token=3c85ca5d-6b82-44b2-838f-9fa4b9e9cb23)

![Mesos resource offer mechanism](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MgdUlOFHBAWBxgoqmhQ%2F-MgdoS6nygcz_vnVPMAn%2FScreen%20Shot%202021-08-09%20at%203.30.18%20PM.png?alt=media\&token=48fb4e1e-ca32-48eb-b92e-759ca8d0b86e)

* YARN: Created out of the necessity to scale Hadoop

![YARN ResourceManager](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MgdUlOFHBAWBxgoqmhQ%2F-Mgdnf8ei7n7-SoybUkJ%2FScreen%20Shot%202021-08-09%20at%203.26.52%20PM.png?alt=media\&token=c6294dc4-5649-4d45-842a-b3f8ecc6e64f)

![The insides of YARN](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MgdUlOFHBAWBxgoqmhQ%2F-Mgdntfe7Jz9S8-8eCFx%2FScreen%20Shot%202021-08-09%20at%203.27.53%20PM.png?alt=media\&token=0485a881-b0fb-4416-9133-a77486931809)

* Project myriad: Composites Mesos and YARN
  * Mesos framework and a YARN scheduler that enables Mesos to manage YARN resource requests

![](https://1313833672-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MMTslgmrrtRXvxD2lk9%2F-MgdUlOFHBAWBxgoqmhQ%2F-Mgdouk_cyV2Nu2wOBxl%2FScreen%20Shot%202021-08-09%20at%203.32.20%20PM.png?alt=media\&token=e2fda0b0-4ea2-4b70-9044-f9953e943094)
