As part of VMWare performance regression testing for Linux Kernel upstream releases, when comparing the Linux Kernel 4.19 RC4 to Linux Kernel 4.18GA, we are able to notice both latency improvements of up to 60% and CPU cost regressions of up to 23% in our storage tests. Details can be found in the table below. After performing a bisect between 4.18GA and 4.19RC4 we identified the root cause of this behavior to be a change that switched the SCSI stack from a single queue to a multi queue model. The details of the change are: ################## scsi: core: switch to scsi-mq by default It has been more than one year since we tried to change the default from legacy to multi queue in SCSI with commit c279bd9e406 ("scsi: default to scsi-mq"). But due to issues with suspend/resume and performance problems it had been reverted again with commit cbe7dfa26eee ("Revert "scsi: default to scsi-mq""). In the meantime there have been a substantial amount of performance improvements and suspend/resume got fixed as well, thus we can re-enable scsi-mq without a significant performance penalty. Author : Johannes Thumshirn <mailto:jthumshirn@xxxxxxx> 2018-07-04 10:53:56 +0200 Committer : Martin K. Petersen <mailto:martin.petersen@xxxxxxxxxx> 2018-07-10 22:42:47 -0400 For more details refer this link: http://url/ragk Change Hash: d5038a13eca72fb216c07eb717169092e92284f1 Author: Johannes Thumshirn <mailto:jthumshirn@xxxxxxx> <2018-07-04 10:53:56> ################## 1. Test Environment Below are the details of our test environment: ESX: vSphere 6.7 GA GOS: RHEL7.5 VM type: Single-VM with 8 vDisks vSCSI controllers: lsisas Kernel: 4.18GA and 4.19RC4 Backend device 1: local SATA SSD (exposed through P420 controller) Backend device 2: FC-8G (connected to EMC VNX 5100 array) Benchmark: ioblazer Block size: 4k & 64k Access pattern: sequential read & sequential write OIO: 16oio/vdisk (16*8 = 128oio) Metrics: Throughput (IOPS), Latency (ms) & CPU cost (CPIO - cycles per I/O) 2. Test Execution We create a RHEL 7.5 VM and attach 8 data disks (vdisks) either as RDM ("raw" disks) on the FC SAN or as VMDKs on local SSD. After running these tests with a Linux 4.18GA kernel to get the baseline results, we rebooted the VM and installed/upgraded to the Linux 4.19RC4 kernel. We then re-ran the ioblazer benchmark with the above configs and measure the throughput, latency and cpucost performance of sequential read & write for 4k & 64K block sizes. 3. Performance Results The following are the performance #'s between the previous change hash "fc21ae8927f391b6e3944f82e417355da5d06a83 (shown as Hash A)" and Johannes change hash "d5038a13eca72fb216c07eb717169092e92284f1 (shown as Hash B)". Sharing the performance #'s for tests executed on local SSD. ------------------------------------------------------------------------------------------------------------- Test Name Metric Hash A Hash B Difference (in %) ------------------------------------------------------------------------------------------------------------- 4k seq-read cpucost 64411 65527 -1.73 latency 0.563 0.4511 24.82 throughput 164342 161167 -1.97 4k seq write cpucost 68199 73034 -7.09 latency 0.5147 0.399 28.92 throughput 181057 181634 0.31 64k seq read cpucost 86799 106143 -22.28 latency 1.436 0.902 59.16 throughput 78573 78741 0.21 64k seq write cpucost 85403 101037 -18.3 latency 2.407 1.494 61.1 throughput 48565 48582 0.03 ------------------------------------------------------------------------------------------------------------- Note: - For cpucost and latency, lower is better. For throughput, higher is better. - Executed above tests for 5 iterations each for both the previous change (Hash A) & problem change (Hash B) and got consistent #'s across iterations. - The above performance data is from local SSD as backend device, in which we see latency improvements of up to 60% and CPU cost regressions of up to 23% - For FC SAN as backend device, the depth is slightly varying - with latency improvements of up to 40% and CPU cost regressions of up to 13%. 4. Conclusions The results indicate very significant latency improvements at the cost of incremental CPU consumption and this behavior is specifically visible for large sequential reads/write operations (e.g. 60% latency improvement combined with 23% increase in CPU consumption for 64k sequential reads.) This can be seen as expected behavior because the change enables more parallelism in the storage stack which will inherently allow for more CPU cycles to be consumed. Assuming that in most deployments (be it bare metal or virtual) administrators or automated resource management tools will strive to keep CPU utilization at 70% or below, this change should be seen as a significant improvement for most customers. An exception could be in non-customer controlled Cloud environments where the additional cycles might not be available and as such the latency improvements might not be achievable. Rajender M Performance Engineering VMware, Inc.