4.1 Spark, Hortonworks, HDFS, CAP

Spark

Motivation: Traditional MapReduce & classical parallel runtimes cannot solve iterative algorithms efficiently
- Hadoop: Repeated data access to HDFS, no optimizations to data caching & data transfers
- MPI: No natural support for fault tolerance; programming interface is complicated
Apache Spark: Extend the MapReduce model to better support two common classes of analytics apps:
- Iterative algorithms (ML, graphs)
- Interactive data mining
Why are current frameworks not working?
- Most cluster programming models use acyclic data flow (from stable storage to stable storage)
- Acyclic data flow is inefficient for apps that repeatedly reuse a working set of data
Solution: Resilient Distributed Datasets (RDDs)
- Advantages
  - Allow apps to keep working sets in memory for efficient reuse
  - Retains the attractive properties of MapReduce (fault tolerance, data locality, scalability)
  - Supports a wide range of applications
- Properties
  - Immutable, partitioned collections of objects
  - Created through parallel transformations (map, filter, groupBy, join) on data in stable storage
  - Can be cached for efficient reuse

Connected data strategy
- HDP: Apache Hadoop is an open-source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data
- HDF: Real-time data collection, curation, analysis, and delivery of data to and from any device, source or system, either on-premise and in the cloud

HDP tools
- Apache Zeppelin: Open web-based notebook that enables interactive data analytics
- Apache Ambari: Source management platform for provisioning, managing, monitoring, and securing Apache Hadoop clusters
HDP data access
- YARN: Data Operating System
  - MapReduce: Batch application framework for structured and unstructured data
  - Pig: Script ETL data pipelines, research on raw data, and iterative data processing
  - Hive: Interactive SQL queries over petabytes of data in Hadoop
  - Hbase Accumulo: Non-relational/NoSQL database on top of HDFS
  - Storm (Stream): Distributed real-time large volumes of high-velocity data
  - Solr (Search): Full-text search and near real-time indexing
  - Spark: In-memory
- Data management: HDFS
HDF
- Apache NiFi, Kafka, and Storm: Provide real-time dataflow management and streaming analytics

HDFS properties
- Synergistic w/ Hadoop
- Massive throughput
- Throughput scales with attached HDs
- Have seen very large production clusters (Facebook, Yahoo)
- Doesn't even pretend to be POSIX compliant
- Optimized for reads, sequential writes, and appends
How can we store data persistently? Ans: Distributed File System replicates files
Distributed File System
- Datanode Servers
  - A file is split into contiguous chunks (16-64MB), each of which is replicated (usually 2x or 3x)
  - Sends heartbeat and BlockReport to namenode
- Replicas are placed: one on a node in a local rack, one on a different node in the local rack, and one on a node in a different rack (lots of back-ups)

Master node (namenode in HDFS) stores metadata, and might be replicated
- Client libraries for file accesses talk to master to find datanode chunk, and then connect directly to datanode servers to access data
Replication pipelining: Data is pipelined from datanode to the next in the background
Staging: A client request to create a file does not reach namenode immediately. Instead, HDFS client caches the data into a temporary file -> once the data size reaches a HDFS block size, the client contacts the namenode -> namenode inserts the filename into its hierarchy and allocates a data block for it -> namenode responds to the client with the identity of the datanode and the destinations of the replicas/datanodes for the block -> client flushes from local memory

Project myriad: Composites Mesos and YARN
- Mesos framework and a YARN scheduler that enables Mesos to manage YARN resource requests

Last updated 4 years ago

Was this helpful?