# 4.2 Large Scale Data Storage

## MapReduce

### Motivation

* Challenges w/ traditional programming models (MPI)
  * Deadlock is possible: Blocking communication can cause deadlock
  * Large overhead from communication mismanagement
  * Load imbalance
  * Hard to code
* Challenges with commodity clusters
  * Web datasets can be very large
  * Standard architectures are emerging -- how to organize computations on this storage?
* Solutions
  * Use distributed storage
    * 6-24 disks attached to a blade, 32-64 blades in a rack connected by Ethernet
  * Push computations down to storage
* Stable storage becomes a first order problem. Answer: Distributed File System
  * Typical usage pattern
    * Huge files (100s of GB to TB)
    * Data is rarely updated in place
    * Reads and appends are common

### TODO

## CAP Theorem & Eventual Consistency

### CAP & Eventual Consistency

* Consistency models for distributed systems: ACID, BASE, Paxos
* CAP theorem (Eric Brewer, 2002; started as conjecture, proven in 2002?): You can have just two of Consistency, Availability, and Partition Tolerance
  * Consistency: All nodes see the same data at the same time
  * Availability: A guarantee that every request receives a response about whether it was successful or failed
  * Partition tolerance: The system continues to operate despite arbitrary message loss or failure of part of the system
* Data centers should weaken consistency for faster response

### ACID & BASE

* ACID
  * Atomicity: Even if transactions have multiple operations, does them to completion (commit) or rolls back so that they leave no effect (abort)
  * Consistency: A transaction that runs on a correct database leaves it in a correct/consistent state
  * Isolation: It looks as if each transaction ran all by itself. Basically says "we'll hide any concurrency"
  * Durability: Once a transaction commits, updates can't be lost or rolled back

### Zookeeper & Paxos

## Distributed Key-Value Store

## Scalable Databases

## Publish-Subscribe Queues (Kafka)
