Proposed patch attached. While doing some performance tuning of various block device parameters on Azure to optimize database workloads, we found that reducing the number of hardware queues (nr_hw_queues used in the block-mq layer and by storvsc) improved performance of short queue depth writes, like database journaling. The Azure premium disks we were testing on often had around 2-3ms latency for a single small write. Given the IOPS and bandwidth provisioned on your typical Azure VM paired with an Azure premium disk, we found that decreasing the LUN queue_depth substantially from the default (default is 2048) was often required to get reasonable IOPS out of a small sequential synchronous write with a queue_depth of 1. In our investigation, the high default queue_depth resulted in such a large number of outstanding requests against the device that a journaling write incurred very high latency, slowing overall database performance. However, even with tuning these defaults (including /sys/block/sdX/queue/nr_requests), we still incurred high latency, especially on machines with high core counts, as the number of hardware queues, nr_hw_queues, is set to be the number of CPUs in storvsc. nr_hw_queues is, in turn, used to calculate the number of tags available on the block device, meaning that we still ended up with deeper queues than intended when requests were submitted to more than one hardware queue. We calculated the optimal block device settings, including our intended queue depth, to utilize as much of provisioned bandwidth as possible while still getting a reasonable number of IOPS from low queue depth sequential IO and random IO, but, without parameterizing the number of hardware queues, we weren't able to fully control this queue_depth. Attached is a patch which adds a module param to control nr_hw_queues. The default is the current value (number of CPUs), so it should have no impact on users who do not set the param. As a minimal example of the potential benefit, we found that, starting with a baseline of optimized block device tunings, we could substantially increase the IOPS of a sequential write job for both a large Azure premium SSD and a small Azure premium SSD. On an Azure Standard_F32s_v2 with a single 16 TiB disk, which has a provisioned max bandwidth of 750 MB/s and provisioned max IOPS of 18000, running Debian 10 with a Linux kernel built from master (v5.11-rc6 at time of patch testing) with the patch applied, with the following block device settings: /sys/block/sdX/device/queue_depth=55 /sys/block/sdX/queue/max_sectors_kb=64, read_ahead_kb=2296, nr_requests=55, wbt_lat_usec=0, scheduler=mq-deadline And this fio job file: [global] time_based=1 ioengine=libaio direct=1 runtime=20 [job1] name=seq_read bs=32k size=23G rw=read numjobs=2 iodepth=110 iodepth_batch_submit=55 iodepth_batch_complete=55 [job2] name=seq_write bs=8k size=10G rw=write numjobs=1 iodepth=1 overwrite=1 With ncpu hardware queues configured, we measured an average of 764 MB/s read throughput and 153 write IOPS. With one hardware queue configured, we measured an average of 763 MB/s read throughput and 270 write IOPS. And on an Azure Standard_F32s_v2 with a single 16 GiB disk, the combination having a provisioned max bandwidth of 170 MB/s and a provisioned max IOPS of 3500, with the following block device settings: /sys/block/sdX/device/queue_depth=11 /sys/block/sdX/queue/max_sectors_kb=65, read_ahead_kb=520, nr_requests=11, wbt_lat_usec=0, scheduler=mq-deadline And with this fio job file: [global] time_based=1 ioengine=libaio direct=1 runtime=60 [job1] name=seq_read bs=32k size=5G rw=read numjobs=2 iodepth=22 iodepth_batch_submit=11 iodepth_batch_complete=11 [job2] name=seq_write bs=8k size=3G rw=write numjobs=1 iodepth=1 overwrite=1 With ncpu hardware queues configured, we measured an average of 123 MB/s read throughput and 56 write IOPS. With one hardware queue configured, we measured an average of 165 MB/s read throughput and 346 write IOPS. Providing this option as a module param will help improve performance of certain workloads on certain devices. In the attached patch, I check that the value provided for storvsc_nr_hw_queues is within a valid range at init time and error out if it is not. I noticed this warning from scripts/checkpatch.pl WARNING: Prefer [subsystem eg: netdev]_err([subsystem]dev, ... then dev_err(dev, ... then pr_err(... to printk(KERN_ERR ... #64: FILE: drivers/scsi/storvsc_drv.c:2183: printk(KERN_ERR "storvsc: Invalid storvsc_nr_hw_queues value of %d. Should I be using a different function for printing this message? Regards, Melanie (Microsoft)
Attachment:
v1-0001-scsi-storvsc-Parameterize-number-hardware-queues.patch
Description: Binary data