Rui's Blog

[2019 FAST] DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed...



This paper presents a new distributed caching mechanism for addressing load imbalance in large-scale storage systems.

Background & Motivation

Cloud service providers use large clusters to store data. The data access workload is skewed (power law distribution), which creates load imbalance, resulting in low throughput and long tail latencies. The objective is to achieve load balancing in distributed storage systems.
A common approach is to add a front-end cache node as a load balancer.
The problem is that nowadays, cloud-scale distributed storage spans across many clusters, which exposes scalability issues. Given that the cache throughput is 10-100 times of the server throughput, one caching node (e.g., a switch) can only guarantee load balancing for 10-100 servers (a few racks of servers within a cluster). In other words, a single cache node only guarantees intra-cluster load balancing, not inter-cluster load balancing.
Adding one cache node as the load balancer within each cluster also doesn't work: between clusters, load imbalance still exists. Adding another cache node atop all the per-cluster cache nodes does not work due to the throughput constraint.
Thus, we need a layer of distributed caching as the load balancer.

Design & Implementation

Some key design choices include:
  • Cache allocation with independent hash functions: The intuition is that if one cache node in a layer is overloaded by receiving too many queries to its cached objects, because the hash functions of the two layers are independent, the set of hot objects would be distributed to multiple cache nodes in another layer with high probability.
  • Query routing with the power-of-two-choices: The sender of a query looks at the loads of the cache nodes that cache the queried object and sends the query to the less-loaded node.
These mechanisms can be applied recursively for multi-layer hierarchical caching.