This year I have been involved in running performance benchmarks of Aeron over at Adaptive on two major cloud providers. I learned quite a few things about the arcane arts science craft of running performance benchmarks.

When benchmarking a piece of software, you really want to get the best performance out of it, which is to say that you also want to run it under the best conditions in order to see what is possible. One aspect I’d like to cover in this article is related to giving the program exclusive access to the CPU core(s) it needs to run, without having other threads interfering with the execution. This involves:

  • pinning a particular CPU core to a particular process
  • isolating one or more CPU cores

CPU pinning ensures that the scheduler of the operating system will always execute a process on the designated CPU core. This is also known as processor affinity.

CPU core isolation ensures that the kernel processes scheduler will not interact with those CPU cores. Normally, the process scheduler may move processes around on CPU cores in order to help provide equal running time to all processes actively executing.

Why do we care?

The main reason for wanting to keep a process running on the same CPU core is to avoid the costs related to thread context switch, which is quite an expensive operation as shown on the last line of the graph below (source):

Latency costs of common CPU operations

By default, the operating system’s scheduler will use different policies to decide how to assign CPU cores to threads. On linux, you can see which scheduling policy is applied to a process by running ps -eLfc and checking the CLS row (which stands for scheduling class):

1
2
3
4
5
6
UID        PID  PPID   LWP NLWP CLS PRI STIME TTY          TIME CMD
root        56     2    56    1 FF   90 Sep13 ?        00:00:00 [watchdogd]
...
root       671     1   671    2 TS   19 Sep13 ?        00:07:47 /usr/sbin/irqbalance --foreground
...
root       429     1   429    7 RR  139 Sep13 ?        00:00:52 /sbin/multipathd -d -s

In the sample above, there are 3 different scheduling policies: FF (FIFO), TS (time-sharing) and RR (round-robin). There are a few more in practice, time-sharing being the one that’s used most frequently.

What this means in practice for a thread is that its execution may be paused for some time while it is moved to a different core. This operation takes quite a few CPU cycles as it leads to cache misses for at least the L1 and L2 caches (which are local to a core). And we don’t want that.

Deciding which cores to use

One of the most critical aspects to get right when it comes to pinning and isolating cores is the range of cores to use. To get this right, there are three rules to follow.

Rule 1: never, ever use core 0

Core 0 is the go-to core for the operating system. It’s where the kernel will run core processes (pun not intended) and no matter what you tell it, it will keep doing so. Don’t run your mission-critical, low-latency process on core zero.

Rule 2: avoid core 1

This one was pretty new to me until I read this article on [the] fear of commitment to the first CPU core. And granted, it may sound a bit paranoid to base all future life and death changing core-picking decisions on the incident that happened in this article. That being said, most modern systems have a lot of cores, so skipping core 1 to be on the safe side of things may not hurt.

Rule 3: know the CPU layout

I mean, this is so important that it might as well be rule 1. This being said if you use core 0 you are guaranteed to get abysimal results whereas depending on how badly you screw up rule 3 you may still get to see okay results.

Consider the lstopo output of an Intel Xeon Platinum 8375C CPU @ 2.90GHz CPU. This CPU has 32 cores and since hyperthreading is enabled, it has 64 hyperthreads (reported as Processing Units):

Hardware topology of an Intel Xeon Platinum 8375C CPU

The physical Core L#0 has two Processing Units (PU#0 and PU#1) which are the hyperthreads P#0 and P#32 respectively. It’s important to understand that whilst the operating system will see 64 (logical) cores, if we want to work with the physical core we have to consider both hyperthreads of the physical core.

For example, if your process runs best on one (physical) CPU core, then in this case you’ll want to isolate both of its hyperthreads. Following rule 1 and rule 2, if we wanted to use core L#2 (not shown on the output of lstopo here, but running lscpu -a -e would give the full detail), then we’d need to isolate the logical cores 2 and 34.

Now that we’ve talked about which cores to work with, let’s talk about how to actually pin and isolate them.

Pinning threads to cores with taskset

taskset allows to pin a process to a specific core. There are several ways of using the command. You can specify the core upfront:

1
2
taskset -c 2 stress -c 1

This will run the stress -c 1 command (which generates load on a system, in this case on one CPU core) on the third core (it’s a zero-based index). taskset -c 2 is the shorthand of taskset --cpu-list 2 which lets you specify one or more cores rather than having to specify CPU masks, which is rather cumbersome.

It is also possible to move a running process to a specific core:

1
2
taskset -p -c 2 <pid>

The latter approach can be seen in action below. Notice in the display of top in the upper right of the screen how the process is started on core 0 and how the CPU usage of the different core changes as we move it around:

Using tasket to pin threads to a core

Now that we’ve seen how to pin cores to a process, let’s see what we need to do to ensure that said cores aren’t going to be left alone by the operating system scheduler.

Isolating cores

In what follows, we’ll assume that we want to free up 4 physical cores for our processes. On the CPU given as example above, this means that we’ll need to isolate the logical core pairs 2/34, 3/35, 4/36 and 5/37. Expressing this as as CPU list, we want to isolate the ranges 2-5,34-37.

Getting rid of kernel noise

The linux kernel is an amazing and complex piece of software that allows to use the underlying hardware without having to worry too much about what is really happening. What this means is that there’s a lot of maintenance and cleanup work going on constantly in order to keep things running smoothly. Some of the work is shared between CPU cores and some of it needs to run on each core. Looking at this from the perspective of someone wanting to run low-latency, jitter-free processees, this works is noise. If you want to learn more about the details of what exactly is going on I invite you to read this article series on CPU isolation which really goes into the detail of things.

In order to get rid of the noise, we need to set the nohz_full kernel boot parameter:

1
nohz_full=2-5,34-37

What this will attempt to do is to keep the housekeeping work away from the specified CPU list, such as:

  • stopping the timer tick, an interrupt which usually runs at 100-1000 Hz on each core and performs many small tasks (in practice the tick isn’t stopped entirely, there’s a residual 1Hz tick that remains)
  • relocating Read-Copy update callbacks to other cores
  • moving unbound kernel workqueues and kthreads to other cores

Note that in order for nohz_full to be able to do its job, the clocksource of the system needs to be a (reliable) TSC (timestamp counter). The TSC is a clock implemented within the processor on x86 architectures. You can set the system to tsc using:

1
echo "tsc" | sudo tee /sys/devices/system/clocksource/clocksource0/current_clocksource

Isolating the cores

We need to make sure that the kernel scheduler won’t schedule any work on the CPUs we want to dedicate to our processes. The preferred way to do this is to use cpusets but there are other techniques as well. Since the context of this article is to create a dedicated environment for benchmarking we’re going to use the isolcpus boot parameter. It is much more rigid than cpusets but for this use-case it will do just fine.

At the core (pun intended), we just need to pass our CPU list via isolcpus like so:

1
nohz_full=2-5,34-37 isolcpus=2-5,34-37

Checking out the kernel documentation though, it is possible to go even further by:

  • using the domain flag to isolate the cores from SMP balancing and scheduling algorithms
  • using the managed_irq flag to isolate from being targeted by managed interrupts

So our boot parameters becomes:

1
nohz_full=2-5,34-37 isolcpus=domain,managed_irq,2-5,34-37

Getting rid of hardware IRQs

The affinity of interrupts managed by the kernel has already been changed via the managed_irq flag of isolcpus, but we still have to take care of the other interrupts.

This is possible by setting the affinitiy of the interrupt.

There are several ways of achieving this:

  • directly setting the list of allowed cores of each interrupt via /proc/irq/IRQ#/smp_affinity_list
  • using irqbalance which takes into account the mask of isolated and adaptive-ticks CPUs on the system when adjusting the affinity to cores
  • setting the default IRQ affinity mask using the irqaffinity boot parameter

Since we’re already heavily relying on kernel boot parameters, let’s use that approach. Our command line becomes:

1
nohz_full=2-5,34-37 isolcpus=domain,managed_irq,2-5,34-37 irqaffinity=0-1,6-33,38-63

Note that it is possible to check what irqbalance has done (i.e. to look at the affinities of all IRQs at once) by running:

find /proc/irq/ -name smp_affinity -print -exec cat {} \; | less

After applying the above irqaffinity, this yields the mask ffffffc3,ffffffc3 (for interrupts that support ranges), which is to say everything but cores 2-5.

Testing that it all works

Now that everything is in place, let’s see if it works. We’ll test this both on a physical machine as well as on an AWS c6i.4xlarge instance.

The first thing to do is to verify that CPU isolation is in effect. This can be achieved by looking at /sys/devices/system/cpu/isolated and checking that it returns the configured parameters. If the file is empty, make sure you have applied the boot parameters correctly and that the specified ranges are correct.

Measuring interruptions on an Intel Core i7-7700K CPU @ 4.20GHz

I have an old desktop machine sitting next to me on why I sometimes run experiments. It has 4 cores (2 threads-per cores, so 8 hyperthreads) and the following topology:

Hardware topology of my Intel Core i7-7700K CPU

Let’s isolate the two last cores. As per the topology, this means we’ll use the following kernel boot parameters:

1
nohz_full=2,6,3,7 isolcpus=domain,managed_irq,2,6,3,7 irqaffinity=0,4,1,5

After applying the boot parameters and restarting the machine, we validate that the isolation settings are in effect:

1
2
manu@andromeda:~$ cat /sys/devices/system/cpu/isolated
2-3,6-7

We’ll be using Erik Rigtorp’s hiccups, which measures jitter introduced by the system. The way it works is described in the documentation:

It runs a thread on each processor core that loops for a fixed interval (by default 5 seconds). Each loop iteration acquires a timestamp and if the difference between the last two successive timestamps exceeds a threshold, the thread is assumed to have been interrupted and the difference is recorded. At the end of the run the number of interruptions, 99th percentile, 99.9th percentile and maximum interruption per processor core is output.

Running hiccups for a minute on the non-isolated cores yields the following:

1
2
3
4
5
6
manu@andromeda:~$ taskset -c 0,1,4,5 hiccups -r 60 | column -t -R 1,2,3,4,5,6
cpu  threshold_ns  hiccups  pct99_ns  pct999_ns  max_ns
  0           256    15209     10045      33451   86355
  1           256    15124      7649      14785   59020
  4           256    15087     31269      42411   71444
  5           256    15283      8184      24662  371776

Whereas on the isolated ones we get the following interruption measurements:

1
2
3
4
5
6
manu@andromeda:~$ taskset -c 2,3,6,7 hiccups -r 60 | column -t -R 1,2,3,4,5,6
cpu  threshold_ns  hiccups  pct99_ns  pct999_ns  max_ns
  2           256    15059      2714       4285    5178
  3           256    15061      2819       4218    5894
  6           256    15064      2828       4092    5585
  7           256    15060      2815       4037    5430

Whilst the amount of hiccups returned by the tool is pretty similar, the p99, p999 and maximum interruption times tell another story entirely.

Measuring interruptions on an AWS c6i.4xlarge instance

The AWS c6i.4xlarge instance type sports an Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz. When running an instance with hyper-threading disabled (one thread per core), we get the following topology:

Hardware topology of an AWS c6i.4xlarge instance

Configuring core isolation on this system is arguably simpler than for a system with hyperthreading enabled. When isolating the last 4 cores we use:

1
nohz_full=4-7 isolcpus=domain,managed_irq,4-7 irqaffinity=0-3

However running hiccups when the machine is idle doesn’t seem to have much on an effect:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
ubuntu@ip-172-31-41-250:~$ taskset -c 0-7 hiccups -r 60 | column -t -R 1,2,3,4,5,6
cpu  threshold_ns  hiccups  pct99_ns  pct999_ns   max_ns
  0           152    75602     12355      54979   697143
  1           152   130606     11276      22738  1918870
  2           152   130984     11043      12805    88752
  3           152   132947     11083      13191    87123
  4           152   131913     10941      11671    16807
  5           152   131874     10960      11699    49622
  6           152   130537     10957      11729    16857
  7           152   131892     10966      11753    19599

The effect of core isolation only becomes visible when the machine is under some kind of load. For example, when running stress -c 1 (which will keep core 1 busy), the hiccups measurement returns:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
ubuntu@ip-172-31-41-250:~$ taskset -c 0-7 hiccups -r 60 | column -t -R 1,2,3,4,5,6
cpu  threshold_ns  hiccups  pct99_ns  pct999_ns    max_ns
  0           152    63424     52761   16014545  24013667
  1           152    69483     16723   16013715  16087101
  2           152    68799     13613   16013628  16023966
  3           152    68412     14756   16013621  16028852
  4           152    78487     11363      12156     16911
  5           152    79553     11349      12104     49461
  6           152    78452     11326      12053     16462
  7           152    79161     11375      12157     19405

Again, the number of interruptions returned by the tool isn’t very telling but the difference in interruption times (at p99, p999 and max) is.

I’ve been able to reproduce this behavior using other tools and configurations. For example, by using Georg Sauthoff’s osjitter on an AWS c6i.4xlarge with hyperthreading enabled and isolating cores 2-6, we get:

  • for an idle instance:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
ubuntu@ip-172-31-47-43:~/osjitter$ ./osjitter -t 60
 CPU  TSC_khz  #intr  #delta  ovfl_ns  invol_ctx  sum_intr_ns  iratio  rt_s  loop_ns  median_ns  p20_ns  p80_ns  p90_ns  p99_ns  p99.9_ns   max_ns  mad_ns
   0  2899950  38684   38684        0        273    251396375   0.004    60       16       6715     263   10053   10984   14437     26997    82546    3017
   1  2899950  38457   38457        0        206    251472522   0.004    60       16       6779     274   10038   10897   13486     26818   165892    2988
   2  2899950  38365   38365        0         21    243407125   0.004    60       15       6695     270    9885   10723   12194     13652    23142    2962
   3  2899950  38106   38106        0         21    242990329   0.004    60       15       6695     283    9877   10726   12200     13593    20418    2932
   4  2899950  38528   38528        0         21    246996192   0.004    60       16       6841     267    9837   10697   12176     13594    21141    2768
   5  2899950  39370   39370        0         21    262076438   0.004    60       15       6918     287   10059   10889   13796     26355   170462    2824
   6  2899950  37990   37990        0         21    246605341   0.004    60       15       6858     284    9889   10732   12173     13526    21042    2773
   7  2899950  38566   38566        0        227    254834470   0.004    60       16       6903     264   10011   10898   13577     25072    94380    2824
   8  2899950  38668   38668        0        245    259893176   0.004    60       16       7121     261   10046   10928   13666     23017   127763    2645
   9  2899950  38434   38434        0        154    256422016   0.004    60       16       7050     273   10034   10860   12812     21942   298864    2628
  10  2899950  38355   38355        0         21    253126395   0.004    60       15       7055     271    9981   10785   12210     13428    21498    2693
  11  2899950  38109   38109        0         21    252437715   0.004    60       15       7023     283    9989   10800   12218     13387    20330    2722
  12  2899950  38529   38529        0         21    249820114   0.004    60       16       6935     267    9967   10783   12222     13406    20620    2809
  13  2899950  38302   38302        0         21    252080900   0.004    60       15       7020     277   10006   10797   12212     13396    21041    2747
  14  2899950  37996   37996        0         21    250511120   0.004    60       16       7031     283    9956   10775   12197     13425    20465    2681
  15  2899950  38538   38538        0        115    257896120   0.004    60       16       6952     264   10029   10885   12744    103169   135354    2823
  • for the same instance under some load (by running apache2 and having the machine getting single-threaded requests from another instance):
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
ubuntu@ip-172-31-47-43:~/osjitter$ ./osjitter -t 60
 CPU  TSC_khz  #intr  #delta  ovfl_ns  invol_ctx  sum_intr_ns  iratio  rt_s  loop_ns  median_ns  p20_ns  p80_ns  p90_ns  p99_ns  p99.9_ns   max_ns  mad_ns
   0  2899950 164697  164697        0      45616   2207874929   0.037    60       16       5786     311   15089   28494   94738    137437   241586    5685
   1  2899950 165528  165528        0      47215   2312898886   0.039    60       16       6103    3107   15472   29558   96614    143726  6486111    5979
   2  2899950  38403   38403        0         21    202904824   0.003    60       15       3413     124    9936   10241   11595     13884    66468    3319
   3  2899950  39053   39053        0         21    201559588   0.003    60       15       3344     102    9880   10200   11443     13328   359140    3249
   4  2899950  39038   39038        0         21    201435427   0.003    60       16       3355     101    9838   10159   11441     13065   478406    3261
   5  2899950  39304   39304        0         21    224852478   0.004    60       15       3494     260   10084   10462   19995     27543   595075    3400
   6  2899950  37928   37928        0         28    203587076   0.003    60       15       3413     238    9942   10256   11615     13918   770577    3319
   7  2899950 221249  221249        0      59233   2802399816   0.047    60       16       5553    3633   15727   26450   93456    138799   667532    3152
   8  2899950 231570  231570        0      59791   2851621959   0.048    60       16       5598    3646   15207   26436   95482    136636  1825883    4600
   9  2899950 230325  230325        0      59051   2852771051   0.048    60       16       5646    3620   15342   26587   96741    141290   565293    4793
  10  2899950  38407   38407        0         21    206791609   0.003    60       15       3503     123   10063   10362   11978     14589   125733    3411
  11  2899950  39059   39059        0         21    206951602   0.003    60       15       3464      99   10055   10356   11990     14492   419261    3371
  12  2899950  39043   39043        0         21    205993288   0.003    60       16       3463      98   10033   10345   11796     14330   182679    3372
  13  2899950  38231   38231        0         21    208939084   0.003    60       15       3524     126   10138   10422   13270     14723   536301    3431
  14  2899950  37931   37931        0         21    206312436   0.003    60       15       3517     237   10022   10323   11835     14135   712267    3425
  15  2899950  74110   74110        0      10070    610446950   0.010    60       16       3174      90   12202   15280   82161    134307  2998068    3085

The only metric which this tool reports as constantly low for isolated cores is the amount of involontary context switches (invol_ctx). The difference is interruptions only becomes visible once the system is under some load.

My best guess so far as to why this happens is that it has something to do with the virtualized nature of the instances and I’d be happy if anyone had a more precise explanation of this.

Conclusion

In this article we have looked into what is involved in reducing jitter induced by the operating system on linux systems, which is to say:

  • picking the correct cores to dedicate to the task
  • configuring the kernel to move away noise (nohz_full) and interrupts (irqaffinity) from these cores
  • isolating the cores (isolcpus)

And remember: don’t use core zero.