4.2 Large Scale Data Storage
MapReduce
Motivation
Challenges w/ traditional programming models (MPI)
Deadlock is possible: Blocking communication can cause deadlock
Large overhead from communication mismanagement
Load imbalance
Hard to code
Challenges with commodity clusters
Web datasets can be very large
Standard architectures are emerging -- how to organize computations on this storage?
Solutions
Use distributed storage
6-24 disks attached to a blade, 32-64 blades in a rack connected by Ethernet
Push computations down to storage
Stable storage becomes a first order problem. Answer: Distributed File System
Typical usage pattern
Huge files (100s of GB to TB)
Data is rarely updated in place
Reads and appends are common
TODO
CAP Theorem & Eventual Consistency
CAP & Eventual Consistency
Consistency models for distributed systems: ACID, BASE, Paxos
CAP theorem (Eric Brewer, 2002; started as conjecture, proven in 2002?): You can have just two of Consistency, Availability, and Partition Tolerance
Consistency: All nodes see the same data at the same time
Availability: A guarantee that every request receives a response about whether it was successful or failed
Partition tolerance: The system continues to operate despite arbitrary message loss or failure of part of the system
Data centers should weaken consistency for faster response
ACID & BASE
ACID
Atomicity: Even if transactions have multiple operations, does them to completion (commit) or rolls back so that they leave no effect (abort)
Consistency: A transaction that runs on a correct database leaves it in a correct/consistent state
Isolation: It looks as if each transaction ran all by itself. Basically says "we'll hide any concurrency"
Durability: Once a transaction commits, updates can't be lost or rolled back
Zookeeper & Paxos
Distributed Key-Value Store
Scalable Databases
Publish-Subscribe Queues (Kafka)
Last updated