# 4.2 Large Scale Data Storage

## MapReduce

### Motivation

* Challenges w/ traditional programming models (MPI)
  * Deadlock is possible: Blocking communication can cause deadlock
  * Large overhead from communication mismanagement
  * Load imbalance
  * Hard to code
* Challenges with commodity clusters
  * Web datasets can be very large
  * Standard architectures are emerging -- how to organize computations on this storage?
* Solutions
  * Use distributed storage
    * 6-24 disks attached to a blade, 32-64 blades in a rack connected by Ethernet
  * Push computations down to storage
* Stable storage becomes a first order problem. Answer: Distributed File System
  * Typical usage pattern
    * Huge files (100s of GB to TB)
    * Data is rarely updated in place
    * Reads and appends are common

### TODO

## CAP Theorem & Eventual Consistency

### CAP & Eventual Consistency

* Consistency models for distributed systems: ACID, BASE, Paxos
* CAP theorem (Eric Brewer, 2002; started as conjecture, proven in 2002?): You can have just two of Consistency, Availability, and Partition Tolerance
  * Consistency: All nodes see the same data at the same time
  * Availability: A guarantee that every request receives a response about whether it was successful or failed
  * Partition tolerance: The system continues to operate despite arbitrary message loss or failure of part of the system
* Data centers should weaken consistency for faster response

### ACID & BASE

* ACID
  * Atomicity: Even if transactions have multiple operations, does them to completion (commit) or rolls back so that they leave no effect (abort)
  * Consistency: A transaction that runs on a correct database leaves it in a correct/consistent state
  * Isolation: It looks as if each transaction ran all by itself. Basically says "we'll hide any concurrency"
  * Durability: Once a transaction commits, updates can't be lost or rolled back

### Zookeeper & Paxos

## Distributed Key-Value Store

## Scalable Databases

## Publish-Subscribe Queues (Kafka)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://blog.ruipan.xyz/earlier-readings-and-notes/cloud-computing-course-notes/4.2-large-scale-data-storage.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
