Data Domain: Avoiding the Disk Bottleneck in the Data Domain Deduplication File System
One-line Summary
The Data Domain Deduplication File System introduces some techniques to relieve the disk bottleneck in previous works. The techniques combined can remove 99% of the disk accesses for deduplication.
Paper Structure Outline
Introduction
Challenges and Observations
Variable vs. Fixed Length Segments
Segment Size
Performance-Capacity Balance
Fingerprint vs. Byte Comparisons
Deduplication Storage System Architecture
Content Store
Segment Store
Container Manager
Acceleration Methods
Summary Vector
Stream-Informed Segment Layout
Locality Preserved Caching
Accelerated Segment Filtering
Experimental Results
Results with Real World Data
I/O Savings with Summary Vector and Locality Preserved Caching
Throughput
Discussion
Related Work
Conclusions
Background & Motivation
The throughput for data deduplication in Venti was bad. This work aims to resolve those issues.
Design and Implementation
The three main performance enhancement techniques are:
Summary vectors
Stream-informed layout
Locality-preserved caching
The Data Domain File System also featured a layered file system architecture.
Layered file system
Content Store: Manages data in files; Breaks data into segments; Does the fingerprinting
Segment Store: Maps segment descriptors (fingerprints) to data; Performs deduplication; Keeps track of references, updates segment index; Compresses segments and pushes to container layer
Container Manager: Provides container abstraction; Write fixed-sized containers entirely
Summary vector (bloom filter)
This avoids going to disk and writing data if the data already exists. Bloom filters are used to check the fingerprint. It may produce false positives but no false negatives (i.e. the summary vector will not tell you that an existing fingerprint doesn't exist), so it's only a slight performance problem.
Stream-informed layout
The motivation is that fingerprints are random, distributed, but the same group of fingerprints is likely to occur at the same time in the future. The idea is to dedicate containers to holding segments & fingerprints in the logical order. This improves locality and fewer number of I/Os by the container manager.
Locality-preserved caching
An extension of the idea that motivated stream-informed layout. We divide the cache into container-sized units and cache on the granularity of containers. If we need one of the items in the container, we will likely need all of them as well.
Procedures of a segment write
Check if the fingerprint is in the segment cache. If yes, then we are done. Else, continue to step 2.
Check the summary vector to see if the data is already written.
If not (new data), then append it to the current container (for locality with the other data in the container). Once the container is full, it is passed off to the container manager.
If yes (old data), it might be a false positive so we check if it really exists.
Check the index, if it's really in there (duplicate), insert it into the segment cache along with all the other fingerprints in the container.
Otherwise, go to step 2.1 (append to the current container).
Evaluation
New Vocabulary
Bloom filter: A space-efficient probabilistic data structure that is used to test whether an element is a member of a set.
Links
Last updated