Rui's Blog

4.1 Spark, Hortonworks, HDFS, CAP


Apache Spark

  • Motivation: Traditional MapReduce & classical parallel runtimes cannot solve iterative algorithms efficiently
    • Hadoop: Repeated data access to HDFS, no optimizations to data caching & data transfers
    • MPI: No natural support for fault tolerance; programming interface is complicated
  • Apache Spark: Extend the MapReduce model to better support two common classes of analytics apps:
    • Iterative algorithms (ML, graphs)
    • Interactive data mining
  • Why are current frameworks not working?
    • Most cluster programming models use acyclic data flow (from stable storage to stable storage)
    • Acyclic data flow is inefficient for apps that repeatedly reuse a working set of data
  • Solution: Resilient Distributed Datasets (RDDs)
    • Advantages
      • Allow apps to keep working sets in memory for efficient reuse
      • Retains the attractive properties of MapReduce (fault tolerance, data locality, scalability)
      • Supports a wide range of applications
    • Properties
      • Immutable, partitioned collections of objects
      • Created through parallel transformations (map, filter, groupBy, join) on data in stable storage
      • Can be cached for efficient reuse

Example Spark Applications

Log mining
Logistic regression

RDD Fault Tolerance

RDDs maintain lineage information that can be used to reconstruct lost partitions

Big Data Distros (Distributions)


  • Connected data strategy
    • HDP: Apache Hadoop is an open-source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly gain insight from massive amounts of structured and unstructured data
    • HDF: Real-time data collection, curation, analysis, and delivery of data to and from any device, source or system, either on-premise and in the cloud
Hortonworks Data Platform (HDP)
  • HDP tools
    • Apache Zeppelin: Open web-based notebook that enables interactive data analytics
    • Apache Ambari: Source management platform for provisioning, managing, monitoring, and securing Apache Hadoop clusters
  • HDP data access
    • YARN: Data Operating System
      • MapReduce: Batch application framework for structured and unstructured data
      • Pig: Script ETL data pipelines, research on raw data, and iterative data processing
      • Hive: Interactive SQL queries over petabytes of data in Hadoop
      • Hbase Accumulo: Non-relational/NoSQL database on top of HDFS
      • Storm (Stream): Distributed real-time large volumes of high-velocity data
      • Solr (Search): Full-text search and near real-time indexing
      • Spark: In-memory
    • Data management: HDFS
  • HDF
    • Apache NiFi, Kafka, and Storm: Provide real-time dataflow management and streaming analytics


Cloudera Enterprise Data Hub (EDH)


  • Platforms for big data
    • MapReduce (Hadoop written in C/C++)
    • NFS
    • Interactive SQL (Drill, Hive Spark SQL, Impala)
    • MapR-DB
    • Search (Apache Solr)
    • Stream Processing (MapR Streams)



  • HDFS properties
    • Synergistic w/ Hadoop
    • Massive throughput
    • Throughput scales with attached HDs
    • Have seen very large production clusters (Facebook, Yahoo)
    • Doesn't even pretend to be POSIX compliant
    • Optimized for reads, sequential writes, and appends
  • How can we store data persistently? Ans: Distributed File System replicates files
  • Distributed File System
    • Datanode Servers
      • A file is split into contiguous chunks (16-64MB), each of which is replicated (usually 2x or 3x)
      • Sends heartbeat and BlockReport to namenode
    • Replicas are placed: one on a node in a local rack, one on a different node in the local rack, and one on a node in a different rack (lots of back-ups)
HDFS architecture
  • Master node (namenode in HDFS) stores metadata, and might be replicated
    • Client libraries for file accesses talk to master to find datanode chunk, and then connect directly to datanode servers to access data
  • Replication pipelining: Data is pipelined from datanode to the next in the background
  • Staging: A client request to create a file does not reach namenode immediately. Instead, HDFS client caches the data into a temporary file -> once the data size reaches a HDFS block size, the client contacts the namenode -> namenode inserts the filename into its hierarchy and allocates a data block for it -> namenode responds to the client with the identity of the datanode and the destinations of the replicas/datanodes for the block -> client flushes from local memory

YARN and Mesos

  • Mesos: Built to be a scalable global resource manager for the entire datacenter
Mesos architecture
Mesos resource offer mechanism
  • YARN: Created out of the necessity to scale Hadoop
YARN ResourceManager
The insides of YARN
  • Project myriad: Composites Mesos and YARN
    • Mesos framework and a YARN scheduler that enables Mesos to manage YARN resource requests