Venti: A New Approach to Archival Storage
One-line Summary
Venti is a “building block” for constructing applications that frequently require an efficient, reliable, and accessible backup mechanism. The main technique for deduplication is to use hashing to identify different blocks.
Paper Structure Outline
Introduction
Background
The Venti Archival Server
Choice of Hash Function
Choice of Storage Technology
Applications
Vac
Physical backup
Plan 9 File system
Implementation
Performance
Reliability and Recovery
Related Work
Future Work
Conclusion
Background & Motivation
The goal of Venti, a block-level network storage system, is to provide a write-once archival repository that can be shared by multiple client machines and applications.
Design and Implementation
The interface of the system is a simple protocol that enables client applications to read and write variable-sized blocks of data. The primary technique is to use hashing on the data blocks to generate a fingerprint for identification. Some results are:
Collision resistant: By using a good hash function, it is possible to consider the hash of a data block as unique.
A block cannot be modified w/o changing its address: The behavior is intrinsically write-once. The write-once policy provides data protection, as no data will be overwritten.
Writes are idempotent: Multiple writes of the same data can be coalesced and do not require additional storage space (no extra space overhead). Duplicate blocks will be discarded and only one copy of the data will be retained.
The fingerprint allows Venti to provide inherent integrity checking of data. This allows the client to avoid errors from undetected data corruption and enables the server to identify when error recovery is necessary.
Fingerprint/Hashing
The authors want Venti to employ a cryptographic hash function. For such a function, it is computationally infeasible to find two distinct inputs that hash to the same value. Venti utilizes the SHA-1 algorithm, which generates a 20-byte hash value, making the collision probability extremely low.
Applications (that work on top of Venti)
Applications are responsible for mapping the namespace to fingerprints. One approach to record the fingerprints is to pack them into additional blocks (pointer blocks) that are also written to the server. This approach can be applied recursively until a single fingerprint is obtained. The paper didn't really go deep into this. Some other work includes sfsro (making a single-byte file system read-only using Merkle trees).
Venti Implementation
When a fingerprint comes in, a lookup on an on-disk index is performed to check where the data is stored (horrible performance due to random lookups of random fingerprints with no locality on disk). The caches are used to leverage the terrible performance, it's still not that good though (a lot of room for improvements, see the Data Domain paper for some follow-ups).
Evaluation
New Vocabulary
Idempotent: The property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application.
Links
Thanks to Brian Chang for the review notes!
Last updated