> For the complete documentation index, see [llms.txt](https://blog.ruipan.xyz/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://blog.ruipan.xyz/earlier-readings-and-notes/cloud-computing-course-notes/4.2-large-scale-data-storage.md).

# 4.2 Large Scale Data Storage

## MapReduce

### Motivation

* Challenges w/ traditional programming models (MPI)
  * Deadlock is possible: Blocking communication can cause deadlock
  * Large overhead from communication mismanagement
  * Load imbalance
  * Hard to code
* Challenges with commodity clusters
  * Web datasets can be very large
  * Standard architectures are emerging -- how to organize computations on this storage?
* Solutions
  * Use distributed storage
    * 6-24 disks attached to a blade, 32-64 blades in a rack connected by Ethernet
  * Push computations down to storage
* Stable storage becomes a first order problem. Answer: Distributed File System
  * Typical usage pattern
    * Huge files (100s of GB to TB)
    * Data is rarely updated in place
    * Reads and appends are common

### TODO

## CAP Theorem & Eventual Consistency

### CAP & Eventual Consistency

* Consistency models for distributed systems: ACID, BASE, Paxos
* CAP theorem (Eric Brewer, 2002; started as conjecture, proven in 2002?): You can have just two of Consistency, Availability, and Partition Tolerance
  * Consistency: All nodes see the same data at the same time
  * Availability: A guarantee that every request receives a response about whether it was successful or failed
  * Partition tolerance: The system continues to operate despite arbitrary message loss or failure of part of the system
* Data centers should weaken consistency for faster response

### ACID & BASE

* ACID
  * Atomicity: Even if transactions have multiple operations, does them to completion (commit) or rolls back so that they leave no effect (abort)
  * Consistency: A transaction that runs on a correct database leaves it in a correct/consistent state
  * Isolation: It looks as if each transaction ran all by itself. Basically says "we'll hide any concurrency"
  * Durability: Once a transaction commits, updates can't be lost or rolled back

### Zookeeper & Paxos

## Distributed Key-Value Store

## Scalable Databases

## Publish-Subscribe Queues (Kafka)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://blog.ruipan.xyz/earlier-readings-and-notes/cloud-computing-course-notes/4.2-large-scale-data-storage.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
