The recently revealed Meltdown and Spectre bugs are not just extraordinary issues of Estimated Time Overhead Of System Calls On Linux, but also performance. The patches that workaround Meltdown introduce the largest kernel performance regressions I’ve ever seen. Many thanks to the engineers working hard to develop workarounds to these processor bugs.
In this post I’ll look at the Linux kernel page table isolation (Estimated Time Overhead Of System Calls On Linux) patches that workaround Meltdown: what overheads to expect, and ways to tune them. Much of my testing was on Linux 4.14.11 and 4.14.12 a month ago, before we deployed in production. Some older kernels have the KAISER patches for Meltdown, and so far the performance overheads look similar. These results aren’t final, since more changes are still being developed, such as for Spectre.
Note that there are potentially four layers of overhead for Meltdown/Spectre, this is just one. They are:
To understand the Estimated Time Overhead Of System Calls On Linux, there are at least five factors at play. In summary:
Syscall rate: there are overheads relative to the syscall rate, although high rates are needed for this to be noticable. At 50k syscalls/sec per CPU the overhead may be 2%, and climbs as the syscall rate increases. At my employer (Netflix), high rates are unusual in cloud, with some exceptions (Estimated Time Overhead Of System Calls On Linux).
Context switches: these add overheads similar to the syscall rate, and I think the context switch rate can simply be added to the syscall rate for the following estimations.
Page fault rate: adds a little more overhead as well, for high rates.
Working set size (hot data): more than 10 Mbytes will cost additional overhead due to TLB flushing. This can turn a 1% overhead (Estimated Time Overhead Of System Calls On Linux) into a 7% overhead. This overhead can be reduced by A) pcid, fully available in Linux 4.14, and B) Huge pages.
Cache access pattern: the overheads are exacerbated by certain access patterns that switch from caching well to caching a little less well. Worst case, this can add an additional 10% overhead, taking (say) the 7% overhead to 17%.
To explore these I wrote a simple microbenchmark where I could vary the syscall rate and the working set size (source). I then analyzed performance during the benchmark (Estimated Time Overhead Of System Calls On Linux), and used other benchmarks to confirm findings. In more detail:
Table of Contents
1. Syscall rate
This is the cost of extra CPU cycles in the syscall path. Plotting the percent performance loss vs syscall rate per CPU, for my microbenchmark:
Applications that have high syscall rates include proxies, databases, and others that do lots of tiny I/O. Also microbenchmarks, which often stress-test the system, will suffer the largest losses. Many services at Netflix are below 10k syscalls/sec per CPU, so this type of overhead is expected to be negligible for us (Estimated Time Overhead Of System Calls On Linux).
This shows the system-wide syscall rate. Divide it by the CPU count (as reported by Estimated Time Overhead Of System Calls On Linux, etc) for the per-CPU rate. Then by 1000 for the graph above. Note that this perf stat command causes some overhead itself, which may be noticeable for high syscall rates (>100k/sec/CPU). You can switch to ftrace/mcount to measure it with lower overhead if that is desired. For example, using my perf-tools:
Then Summing The Syscall Column.
I could have taken one measurement and extrapolated most of the above graph based on a Estimated Time Overhead Of System Calls On Linux, but it’s good to double check that there’s no hidden surprises. The graph is mostly as expected, expect the lower right which shows variance and missing data points: these missing points are due to slightly negative values that are elided by the logarithmic scale. It would be easy to dismiss this all as a 0.2% error margin, magnified by the logarithmic scale, but I think much of it is part of the profile which changes at this syscall rate (between 5k and 10k syscalls/sec/CPU). This will be more obvious in the next sections, where I’ll explore it further.
My microbenchmark calls a fast syscall in a loop, along with a user-level loop to simulate an application working set size. By gradually increasing the time in the user-level loop from zero, the syscall rate decreases as more cycles are consumed in a CPU-bound user-mode thread. This gives us a spectrum from a high syscall rate of >1M/sec, down to a low syscall rate and mostly user time (>99%). I then measured this syscall range with and without the KPTI patches (by setting nopti, and also running older and newer kernels to double check).
I collected CPU profile as a CPU flame graph for both systems, but, for a change, they was boring: extra cycles were just in the syscall code, as one would expect reading the Estimated Time Overhead Of System Calls On Linux changes. To understand the overhead further I’ll need to use instruction-level profiling, such as by using perf annotate, and PMC (Performance Monitoring Counter) analysis of the CPU cycles (more on this later).
2. Context Switch & Page Fault Rate
As with syscalls, the higher these rates are, the higher the overheads. I’d add the context switch rate (cswch/s) to the syscall rate for the estimation graphs on this page (normalized per-CPU).
3. Working Set Size (hot data)
Now my microbenchmark simulates a working set size (Estimated Time Overhead Of System Calls On Linux) – an area of frequently accessed memory – by reading 100 Mbytes of data in a loop and striding by the cacheline size. Performance gets much worse for lower syscall rates:
Note the overhead “jump” between 10k and 5k syscalls/sec/CPU? The characteristics of this jump is dependent on the instance type, processor, and workload, and can appear at different points with different magnitudes: this is showing a c5.9xl with a 1.8% jump, but on a m4.2xl the jump is around 50k syscalls/sec/CPU and is much larger: 10% (shown here). I’ll analyze this in the next section. Here, we just want to look at the overall trend: much worse performance given a working set.
Will your overhead look like this graph, or the earlier one, or worse? That depends on your working set size: this graph is for 100 Mbytes, and the earlier was zero. See my previous post on working set size estimation and the full website. My guess is that 100 Mbytes is a large-ish case of memory needed syscall-to-syscall.
Linux 4.14 introduced full pcid support, which improves performance provided the processor also has pcid (looks like it’s common in EC2). Gil Tene wrote a post explaining why Estimated Time Overhead Of System Calls On Linux is now a critical performance/security feature on x86.
Update: Earlier versions of Linux had some pcid support (as pointed out by a comment on this post), and/or I’ve heard vendors like Canonical have backported pcid patches. I’ve since tested the 4.4.115 kernel and confirmed that pcid is used there. In detail, booting with “nopcid” for a given workload had 8.4% DTLB and 4.4% ITLB miss walk handling, and then booting default (with pcid) had 0.0% DTLB and 1.1% ITLB. So yes, pcid is working on 4.4 (on 4.4.115, at least).
It’s also possible to use huge pages to improve performance further (either transparent huge pages, THP, which is easy to setup but has had problems in older versions with compaction; or explicit huge pages, which should perform better). The following graph summarizes the options:
For this microbenchmark, huge pages improved performance so much – despite KPTI – that it turned the performance loss into a performance gain. The logarithmic axis has elided the negative points, but here they are on a linear axis, zoomed:
Let’s say your server was doing 5k syscalls/sec/CPU, and you suspect you have a large working Estimated Time Overhead Of System Calls On Linux, similar to this 100 Mbyte test. On recent LTS Linux (4.4 or 4.9) with KPTI (or KAISER) patches, the performance overhead would be about 2.1%. Linux 4.14 has pcid support, so that overhead becomes about 0.5%. With huge pages, the overhead becomes a gain of 3.0%.
A quick look at TLB PMCs explains much of this overhead: here I’m using tlbstat, a quick tool I hacked up in my pmc-cloud-tools repository. The following results look at a single point in the above graph (on a different system with full PMCs) where the worst case overhead from no KPTI to KPTI without pcid was 8.7%.
CPU cycles have page walks active. I’ve never seen TLB pain this bad, ever. Just the IPC (instructions per cycle) alone tells you something bad is happening: dropping from 0.86 to 0.10, which is relative to the performance loss. I still recommend including IPC along with (not instead of) any %CPU measurement, so that you really know what the cycles are, as I discussed in CPU Utilization is Wrong.
4. Cache Access Pattern
Depending on the memory access pattern and working set size, an additional 1% to 10% overhead can occur at a certain syscall rate. This was seen as the jump in the earlier graph. Based on PMC analysis and the description of the changes, one suspected factor is additional page table memory demand on the KPTI system, causing the workload to fall out of a CPU cache sooner. Here are the relevant metrics:
This shows that the performance loss jump corresponds to a similar drop in last-level cache (Estimated Time Overhead Of System Calls On Linux) references (first two graphs): that is not actually interesting, as the lower reference rate is expected from a lower workload throughput. What is interesting is an abrupt drop in LLC hit ratio, from about 55% to 50%, which does not happen without the KPTI patches (the last nopti graph, which shows a small LLC hit ratio improvement). This is sounding like the extra KPTI page table references has pushed the working set out of a CPU cache, causing an abrupt drop in performance.
I don’t think a 55% to 50% drop in LLC hit ratio can fully explain a 10% performance loss alone. Another factor is at play that would require additional PMCs to analyze, however, this target is an m4.16xl where PMCs are restricted to the architectural set. I described this in The PMCs of EC2: the architectural set is better than nothing, and a year ago we had nothing. Late last year, EC2 gained two more options for PMC analysis: the Nitro hypervisor, which provides all PMCs (more or less), and the bare metal instance type (Estimated Time Overhead Of System Calls On Linux). My earlier TLB analysis was on a c5.9xl, a Nitro hypervisor system. Unfortunately, the “jump” on that system is only about 1%, making it harder to spot outside the normal variance.
In short, there are some additional memory overheads with KPTI that can cause workloads to drop out of CPU cache a little sooner.
Sanity Tests
Putting these graphs to the test: a MySQL OLTP benchmark ran with 75k syscalls/sec/CPU (600k syscalls/sec across 8 CPUs), and is expected to have a large working set (so more closely resembles this 100Mb working set test, than the 0Mb one). The graph estimates the performance loss of KPTI (without huge pages or pcid) to be about 4%. The measured performance loss was 5%. On a 64-CPU system, with the same workload and same per-CPU syscall rate, the measured performance loss was 3%. Other tests were similarly close, as has been the production roll out.
The test that least matched the previous graphs was an application with a stress test driving 210k syscalls/sec/CPU + 27k context-switches/sec/CPU, and a small working set (~25 Mbytes). The estimated time overhead of system calls on linux performance loss should have been less than 12%, but it was 25%. To check that there wasn’t an additional overhead source, I analyzed this and saw that the 25% was due to TLB page walk cycles, which is the same overheads studied earlier. I’d guess the large discrepancy was because that application workload was more sensitive to TLB flushing than my simple microbenchmark.
1. Linux 4.14 & pcid
I mentioned this earlier, and showed the difference in the graph. It definitely works on 4.14, if you can run that, with CPUs that support pcid (check /proc/cpuinfo). It might work on some older kernels (like 4.4.115).
2. Huge Pages
I mentioned this earlier too. I won’t summarize how to configure huge pages here with all their caveats, since that’s a huge topic.
3. Syscall Reductions
If you were at the more painful end of the performance loss spectrum due to a high syscall rate, then an obvious move is to analyze what those syscalls are and look for ways to eliminate some. This used to be routine for systems performance analysis many years ago, but more recently the focus has been user-mode wins.
There are many ways to analyze syscalls. Here are Estimated Time Overhead Of System Calls On Linux, ordered from most to least overhead:
The fastest is ftrace/mcount, and I already had an example earlier from my perf-tools repo for counting syscalls. Summarizing 10 seconds:
Ending tracing…
Here’s the same using my bcc/eBPF version of funccount and the syscall tracepoints, which was only possible since Linux 4.14 thanks to Yonghong Song’s support for this:
Tracing 310 functions for “t:syscalls:sys_enter*”… Hit Ctrl-C to end.
Now that you know the most frequent syscalls, look for ways to reduce them. You can use other tools to inspect their arguments and stack traces (eg: using perf record, or kprobe in perf-tools, trace in bcc, etc), and look for optimization opportunities. This is performance engineering 101.
Conclusion and Further Reading
The KPTI patches to mitigate Meltdown can incur massive overhead, anything from 1% to over 800%. Where you are on that spectrum depends on your syscall and page fault rates, due to the extra CPU cycle overheads, and your memory working set size, due to TLB flushing on syscalls and context switches. I described these in this post, and analyzed them for a microbenchmark. Of course, nothing beats testing with real workloads, and the analysis I’ve included here may be more useful in explaining why performance regressed and showing opportunities to tune it, than for the Estimated Time Overhead Of System Calls On Linux regressions.
Practically, I’m expecting the cloud systems at my employer (Netflix) to experience between 0.1% and 6% overhead with KPTI due to our syscall rates, and I’m expecting we’ll take that down to less than 2% with tuning: using 4.14 with pcid support, huge pages (which can also provide some gains), syscall reductions, and anything else we find.
This is only one out of four potential sources of overhead from Meltdown/Spectre: there’s also cloud hypervisor changes, Intela microcode, and compilation changes. These KPTI numbers are also not final, as Linux is still being developed and improved.