# 4.1 Spark, Hortonworks, HDFS, CAP

## Spark

### Apache Spark

* Motivation: Traditional MapReduce & classical parallel runtimes cannot solve iterative algorithms efficiently
  * Hadoop: Repeated data access to HDFS, no optimizations to data caching & data transfers
  * MPI: No natural support for fault tolerance; programming interface is complicated
* Apache Spark: Extend the MapReduce model to better support two common classes of analytics apps:
  * Iterative algorithms (ML, graphs)
  * Interactive data mining
* Why are current frameworks not working?
  * Most cluster programming models use acyclic data flow (from stable storage to stable storage)
  * Acyclic data flow is inefficient for apps that repeatedly reuse a working set of data
* Solution: **Resilient Distributed Datasets (RDDs)**
  * Advantages
    * Allow apps to keep working sets in memory for efficient reuse
    * Retains the attractive properties of MapReduce (fault tolerance, data locality, scalability)
    * Supports a wide range of applications
  * Properties
    * Immutable, partitioned collections of objects
    * Created through parallel transformations (map, filter, groupBy, join) on data in stable storage
    * Can be cached for efficient reuse

### Example Spark Applications

![Log mining](/files/-MgdaZIQmn72Ul6Epq8L)

![Logistic regression](/files/-MgdagZZgm_wOw9_id6k)

### RDD Fault Tolerance

![RDDs maintain lineage information that can be used to reconstruct lost partitions](/files/-Mgdb6Q5zFpQOkdWf2EG)

## Big Data Distros (Distributions)

### Hortonworks

* Connected data strategy
  * HDP: Apache Hadoop is an open-source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data
  * HDF: Real-time data collection, curation, analysis, and delivery of data to and from any device, source or system, either on-premise and in the cloud

![Hortonworks Data Platform (HDP)](/files/-Mgddl1alNQL7se_yHRj)

* HDP tools
  * Apache Zeppelin: Open web-based notebook that enables interactive data analytics
  * Apache Ambari: Source management platform for provisioning, managing, monitoring, and securing Apache Hadoop clusters
* HDP data access
  * YARN: Data Operating System
    * MapReduce: Batch application framework for structured and unstructured data
    * Pig: Script ETL data pipelines, research on raw data, and iterative data processing
    * Hive: Interactive SQL queries over petabytes of data in Hadoop
    * Hbase Accumulo: Non-relational/NoSQL database on top of HDFS
    * Storm (Stream): Distributed real-time large volumes of high-velocity data
    * Solr (Search): Full-text search and near real-time indexing
    * Spark: In-memory
  * Data management: HDFS
* HDF
  * Apache NiFi, Kafka, and Storm: Provide real-time dataflow management and streaming analytics

### Cloudera

![Cloudera Enterprise Data Hub (EDH)](/files/-MgdhYILVQW3M5Gw2ayD)

### MapR

* Platforms for big data
  * MapReduce (Hadoop written in C/C++)
  * NFS
  * Interactive SQL (Drill, Hive Spark SQL, Impala)
  * MapR-DB
  * Search (Apache Solr)
  * Stream Processing (MapR Streams)

## HDFS

### HDFS

* HDFS properties
  * Synergistic w/ Hadoop
  * Massive throughput
  * Throughput scales with attached HDs
  * Have seen very large production clusters (Facebook, Yahoo)
  * Doesn't even pretend to be POSIX compliant
  * Optimized for reads, sequential writes, and appends
* How can we store data persistently? Ans: Distributed File System replicates files
* Distributed File System
  * Datanode Servers
    * A file is split into contiguous chunks (16-64MB), each of which is replicated (usually 2x or 3x)
    * Sends heartbeat and BlockReport to namenode
  * Replicas are placed: one on a node in a local rack, one on a different node in the local rack, and one on a node in a different rack (lots of back-ups)

![HDFS architecture](/files/-MgdkHcPaKzeFoNq8aVb)

* Master node (namenode in HDFS) stores metadata, and might be replicated
  * Client libraries for file accesses talk to master to find datanode chunk, and then connect directly to datanode servers to access data
* Replication pipelining: Data is pipelined from datanode to the next in the background
* Staging: A client request to create a file does not reach namenode immediately. Instead, HDFS client caches the data into a temporary file -> once the data size reaches a HDFS block size, the client contacts the namenode -> namenode inserts the filename into its hierarchy and allocates a data block for it -> namenode responds to the client with the identity of the datanode and the destinations of the replicas/datanodes for the block -> client flushes from local memory

### YARN and Mesos

* Mesos: Built to be a scalable global resource manager for the entire datacenter

![Mesos architecture](/files/-MgdoJo_AbvEfUWf8ic6)

![Mesos resource offer mechanism](/files/-MgdoS6nygcz_vnVPMAn)

* YARN: Created out of the necessity to scale Hadoop

![YARN ResourceManager](/files/-Mgdnf8ei7n7-SoybUkJ)

![The insides of YARN](/files/-Mgdntfe7Jz9S8-8eCFx)

* Project myriad: Composites Mesos and YARN
  * Mesos framework and a YARN scheduler that enables Mesos to manage YARN resource requests

![](/files/-Mgdouk_cyV2Nu2wOBxl)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://blog.ruipan.xyz/earlier-readings-and-notes/cloud-computing-course-notes/4.1-spark-hortonworks-hdfs-cap.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
