4.2 Large Scale Data Storage

MapReduce

Motivation

  • Challenges w/ traditional programming models (MPI)

    • Deadlock is possible: Blocking communication can cause deadlock

    • Large overhead from communication mismanagement

    • Load imbalance

    • Hard to code

  • Challenges with commodity clusters

    • Web datasets can be very large

    • Standard architectures are emerging -- how to organize computations on this storage?

  • Solutions

    • Use distributed storage

      • 6-24 disks attached to a blade, 32-64 blades in a rack connected by Ethernet

    • Push computations down to storage

  • Stable storage becomes a first order problem. Answer: Distributed File System

    • Typical usage pattern

      • Huge files (100s of GB to TB)

      • Data is rarely updated in place

      • Reads and appends are common

TODO

CAP Theorem & Eventual Consistency

CAP & Eventual Consistency

  • Consistency models for distributed systems: ACID, BASE, Paxos

  • CAP theorem (Eric Brewer, 2002; started as conjecture, proven in 2002?): You can have just two of Consistency, Availability, and Partition Tolerance

    • Consistency: All nodes see the same data at the same time

    • Availability: A guarantee that every request receives a response about whether it was successful or failed

    • Partition tolerance: The system continues to operate despite arbitrary message loss or failure of part of the system

  • Data centers should weaken consistency for faster response

ACID & BASE

  • ACID

    • Atomicity: Even if transactions have multiple operations, does them to completion (commit) or rolls back so that they leave no effect (abort)

    • Consistency: A transaction that runs on a correct database leaves it in a correct/consistent state

    • Isolation: It looks as if each transaction ran all by itself. Basically says "we'll hide any concurrency"

    • Durability: Once a transaction commits, updates can't be lost or rolled back

Zookeeper & Paxos

Distributed Key-Value Store

Scalable Databases

Publish-Subscribe Queues (Kafka)

Last updated