The Linux Scheduler: a Decade of Wasted Cores
Last updated
Last updated
Linus gets roasted.
This paper pinpoints some performance bugs in the Linux scheduler (especially in multi-core systems) and proposes fixes, during which the authors developed tools for checking and understanding these bugs.
Introduction
The Linux Scheduler
On a single-CPU system, CFS is very simple
On multi-core systems, CFS becomes quite complex
The load balancing algorithm
Optimizations
Bugs
The Group Imbalance bug
The Scheduling Group Construction bug
The Overload-on-Wakeup bug
The Missing Scheduling Domains bug
Discussion
Tools
Online Sanity Checker
Scheduler Visualization tool
Lessons Learned
Related Work
Conclusion
The Linux kernel's process scheduler underwent three main periods of evolutions:
v0.01~v2.4.x: the very first scheduler
v2.6.0~v2.6.22: O(1) scheduler
v2.6.23~: Completely Fair Scheduler (CFS)
There were also variations like the Brain Fuck Scheduler (BFS) which works better than CFS on desktop Linux systems with <16 cores, but it does not scale well to large-scale systems (4096 processors / NUMA), and it had some other drawbacks so it was never merged into the mainline Linux kernel.
Modern Linux uses a Completely Fair Scheduler (CFS) which implements the Weighted Fair Queueing (WFQ) algorithm. On a single-CPU system, the CFS is really simple -- the CFS does time-slicing among running threads to achieve fair sharing. On multi-core systems, however, things get a bit messy -- To address scalability and keep the context switches fast, per-core runqueues are used, and in order for the scheduling algorithm to work correctly and efficiently, the runqueues must be kept balanced. The optimizations done by the load-balancing algorithm is complex and lead to bugs.
When a core attempts to steal work from another node, or, in other words, from another scheduling group, it does not examine the load of every core in that group, it only looks at the group’s average load. If the average load of the victim scheduling group is greater than that of its own, it will attempt to steal from that group; otherwise it will not. This is the exact reason why in our situation the underloaded cores fail to steal from the overloaded cores on other nodes. They observe that the average load of the victim node’s scheduling group is not any greater than their own. The core trying to steal work runs on the same node as the high-load R thread; that thread skews up the average load for that node and conceals the fact that some cores are actually idle. At the same time, cores on the victim node, with roughly the same average load, have lots of waiting threads.
The fix is simple: When the algorithm compares the load of scheduling groups, it should be comparing the minimum loads instead of the average loads.
If the minimum load of one scheduling group is lower than the minimum load of another scheduling group, it means that the first scheduling group has a core that is less loaded than all cores in the other group, and thus a core in the first group must steal from the second group. This algorithm ensures that no core of the second group will remain overloaded while a core of the first group has a smaller load, thus balancing the load across cores.
This bug requires specific hardware topology to trigger (2 nodes that are two hops apart).
The root cause for this bug is that scheduling group construction is not adapted to modern NUMA machines. In the above example, the first two scheduling examples look like this:
{0, 1, 2, 4, 6}, {1, 2, 3, 4, 5, 7}
Notice how node 1 and 2 are included in both scheduling groups. The fix is to modify the construction of scheduling groups so that each core uses scheduling groups constructed from its perspective.
When a thread goes to sleep on node X and the thread that wakes it up later is running on that same node, the scheduler only considers the cores of node X for scheduling the awakened thread. If all cores of node X are busy, the thread will miss opportunities to use idle cores on other machines. The fix is to alter the code that is executed when a thread wakes up.
We wake up the thread on the local core – i.e. the core where the thread was scheduled last – if it is idle; otherwise if there are idle cores in the system, we wake up the thread on the core that has been idle for the longest amount of time. If there are no idle cores, we fall back to the original algorithm to find the core where the thread will wake up.
When a core is disabled and then re-enabled using the /proc
interface, load balancing between any NUMA nodes is no longer performed.
During code refractoring, the Linux developers dropped the call to the function that regenerates domains across NUMA nodes. The fix is to simply add it back.
The authors developed two tools that help with understanding the bugs:
Online sanity checker: It verifies that no core is idle while other core's runqueue has waiting threads. It's fine if such conditions exist for a short period, but an alert is raised if it persists.
Scheduler visualizer: It shows the scheduling activity over time. Some of the graphs (e.g., figure 3) was produced using the tool.
Scheduling is complicated. More optimizations to the scheduler will be proposed due to the fast-evolving hardware. With optimizations comes complexity, and the scheduler should be designed so that it can integrate the modularized optimizations with ease.
Visualization is important. Checking the aforementioned bugs using conventional tools is tricky.
LKML: The Linux Kernel Mailing List