Performance improvements & regressions in storage IO tests in Linux Kernel 4.19

Rajender M <manir@xxxxxxxxxx> · Wed, 31 Oct 2018 08:19:54 +0000

As part of VMWare performance regression testing for Linux Kernel upstream releases, when comparing the Linux Kernel 4.19 RC4 to Linux Kernel 4.18GA, we are able to notice both latency improvements of up to 60% and CPU cost regressions of up to 23% in our storage tests. Details can be found in the table below. 

After performing a bisect between 4.18GA and 4.19RC4 we identified the root cause of this behavior to be a change that switched the SCSI stack from a single queue to a multi queue model. The details of the change are: 

##################
scsi: core: switch to scsi-mq by default
It has been more than one year since we tried to change the default from legacy to multi queue in SCSI with commit c279bd9e406 ("scsi: default to scsi-mq"). But due to issues with suspend/resume and performance problems it had been reverted again with commit cbe7dfa26eee ("Revert "scsi: default to scsi-mq"").

In the meantime there have been a substantial amount of performance improvements and suspend/resume got fixed as well, thus we can re-enable scsi-mq without a significant performance penalty.

Author          :  Johannes Thumshirn <mailto:jthumshirn@xxxxxxx>   2018-07-04 10:53:56 +0200
Committer   :  Martin K. Petersen <mailto:martin.petersen@xxxxxxxxxx>    2018-07-10 22:42:47 -0400

For more details refer this link: http://url/ragk

Change Hash: d5038a13eca72fb216c07eb717169092e92284f1
Author: Johannes Thumshirn <mailto:jthumshirn@xxxxxxx>  <2018-07-04 10:53:56>
##################

1. Test Environment
Below are the details of our test environment:

ESX: vSphere 6.7 GA
GOS: RHEL7.5
VM type: Single-VM with 8 vDisks
vSCSI controllers: lsisas 
Kernel: 4.18GA and 4.19RC4
Backend device 1: local SATA SSD (exposed through P420 controller)
Backend device 2: FC-8G (connected to EMC VNX 5100 array)
Benchmark: ioblazer 
Block size: 4k & 64k
Access pattern: sequential read & sequential write           
OIO: 16oio/vdisk (16*8 = 128oio)
Metrics: Throughput (IOPS), Latency (ms) & CPU cost (CPIO - cycles per I/O)

2. Test Execution
We create a RHEL 7.5 VM and attach 8 data disks (vdisks) either as RDM ("raw" disks) on the FC SAN or as VMDKs on local SSD. After running these tests with a Linux 4.18GA kernel to get the baseline results, we rebooted the VM and installed/upgraded to the Linux 4.19RC4 kernel. We then re-ran the ioblazer benchmark with the above configs and measure the throughput, latency and cpucost performance of sequential read & write for 4k & 64K block sizes. 

3. Performance Results
The following are the performance #'s between the previous change hash "fc21ae8927f391b6e3944f82e417355da5d06a83 (shown as Hash A)" and Johannes change hash "d5038a13eca72fb216c07eb717169092e92284f1 (shown as Hash B)". Sharing the performance #'s for tests executed on local SSD. 

-------------------------------------------------------------------------------------------------------------
Test Name          Metric                  Hash A                 Hash B                 Difference (in %)
-------------------------------------------------------------------------------------------------------------
4k seq-read         cpucost                64411                   65527                   -1.73
                              latency                 0.563                    0.4511                  24.82
                              throughput          164342                 161167                 -1.97

4k seq write        cpucost                68199                   73034                   -7.09
                              latency                 0.5147                  0.399                    28.92
                              throughput          181057                 181634                 0.31

64k seq read       cpucost                86799                   106143                 -22.28
                              latency                 1.436                    0.902                    59.16
                              throughput          78573                   78741                   0.21

64k seq write     cpucost                85403                   101037                 -18.3
                              latency                 2.407                    1.494                    61.1
                              throughput          48565                   48582                   0.03
-------------------------------------------------------------------------------------------------------------

Note: 
- For cpucost and latency, lower is better. For throughput, higher is better. 
- Executed above tests for 5 iterations each for both the previous change (Hash A) & problem change (Hash B) and got consistent #'s across iterations. 
- The above performance data is from local SSD as backend device, in which we see latency improvements of up to 60% and CPU cost regressions of up to 23%
- For FC SAN as backend device, the depth is slightly varying - with latency improvements of up to 40% and CPU cost regressions of up to  13%. 

4. Conclusions
The results indicate very significant latency improvements at the cost of incremental CPU consumption and this behavior is specifically visible for large sequential reads/write operations (e.g. 60% latency improvement combined with 23% increase in CPU consumption for 64k sequential reads.) This can be seen as expected behavior because the change enables more parallelism in the storage stack which will inherently allow for more CPU cycles to be consumed. Assuming that in most deployments (be it bare metal or virtual) administrators or automated resource management tools will strive to keep CPU utilization at 70% or below, this change should be seen as a significant improvement for most customers. An exception could be in non-customer controlled Cloud environments where the additional cycles might not be available and as such the latency improvements might not be achievable.

Rajender M
Performance Engineering
VMware, Inc.