RE: [PATCH v1] scsi: storvsc: Parameterize nr_hw_queues

Michael Kelley <mikelley@xxxxxxxxxxxxx> · Wed, 3 Feb 2021 00:50:34 +0000

From: Melanie Plageman <melanieplageman@xxxxxxxxx> Sent: Tuesday, February 2, 2021 12:26 PM
> 
> Proposed patch attached.
>

For public mailing lists like linux-hyperv@xxxxxxxxxxxxxxx that are
associated with the Linux kernel project, proposed patches should
generally go inline in the email, rather than as an attachment.

> 
> While doing some performance tuning of various block device parameters
> on Azure to optimize database workloads, we found that reducing the
> number of hardware queues (nr_hw_queues used in the block-mq layer and
> by storvsc) improved performance of short queue depth writes, like
> database journaling.
> 
> The Azure premium disks we were testing on often had around 2-3ms
> latency for a single small write. Given the IOPS and bandwidth
> provisioned on your typical Azure VM paired with an Azure premium disk,
> we found that decreasing the LUN queue_depth substantially from the
> default (default is 2048) was often required to get reasonable IOPS out
> of a small sequential synchronous write with a queue_depth of 1. In our
> investigation, the high default queue_depth resulted in such a large
> number of outstanding requests against the device that a journaling
> write incurred very high latency, slowing overall database performance.
> 
> However, even with tuning these defaults (including
> /sys/block/sdX/queue/nr_requests), we still incurred high latency,
> especially on machines with high core counts, as the number of hardware
> queues, nr_hw_queues, is set to be the number of CPUs in storvsc.
> nr_hw_queues is, in turn, used to calculate the number of tags available
> on the block device, meaning that we still ended up with deeper queues
> than intended when requests were submitted to more than one hardware
> queue.
> 
> We calculated the optimal block device settings, including our intended
> queue depth, to utilize as much of provisioned bandwidth as possible
> while still getting a reasonable number of IOPS from low queue depth
> sequential IO and random IO, but, without parameterizing the number of
> hardware queues, we weren't able to fully control this queue_depth.
> 
> Attached is a patch which adds a module param to control nr_hw_queues.
> The default is the current value (number of CPUs), so it should have no
> impact on users who do not set the param.
> 
> As a minimal example of the potential benefit, we found that, starting
> with a baseline of optimized block device tunings, we could
> substantially increase the IOPS of a sequential write job for both a
> large Azure premium SSD and a small Azure premium SSD.
> 
> On an Azure Standard_F32s_v2 with a single 16 TiB disk, which has a
> provisioned max bandwidth of 750 MB/s and provisioned max IOPS of
> 18000, running Debian 10 with a Linux kernel built from master
> (v5.11-rc6 at time of patch testing) with the patch applied, with the
> following block device settings:
> 
> /sys/block/sdX/device/queue_depth=55
> 
> /sys/block/sdX/queue/max_sectors_kb=64,
>                     read_ahead_kb=2296,
>                     nr_requests=55,
>                     wbt_lat_usec=0,
>                     scheduler=mq-deadline
> 
> And this fio job file:
>   [global]
>   time_based=1
>   ioengine=libaio
>   direct=1
>   runtime=20
> 
>   [job1]
>   name=seq_read
>   bs=32k
>   size=23G
>   rw=read
>   numjobs=2
>   iodepth=110
>   iodepth_batch_submit=55
>   iodepth_batch_complete=55
> 
>   [job2]
>   name=seq_write
>   bs=8k
>   size=10G
>   rw=write
>   numjobs=1
>   iodepth=1
>   overwrite=1
> 
> With ncpu hardware queues configured, we measured an average of 764 MB/s
> read throughput and 153 write IOPS.
> 
> With one hardware queue configured, we measured an average of 763 MB/s
> read throughput and 270 write IOPS.
> 
> And on an Azure Standard_F32s_v2  with a single 16 GiB disk, the
> combination having a provisioned max bandwidth of 170 MB/s and a
> provisioned max IOPS of 3500, with the following block device settings:
> 
> /sys/block/sdX/device/queue_depth=11
> 
> /sys/block/sdX/queue/max_sectors_kb=65,
>                       read_ahead_kb=520,
>                       nr_requests=11,
>                       wbt_lat_usec=0,
>                       scheduler=mq-deadline
> 
> And with this fio job file:
>   [global]
>   time_based=1
>   ioengine=libaio
>   direct=1
>   runtime=60
> 
>   [job1]
>   name=seq_read
>   bs=32k
>   size=5G
>   rw=read
>   numjobs=2
>   iodepth=22
>   iodepth_batch_submit=11
>   iodepth_batch_complete=11
> 
>   [job2]
>   name=seq_write
>   bs=8k
>   size=3G
>   rw=write
>   numjobs=1
>   iodepth=1
>   overwrite=1
> 
> With ncpu hardware queues configured, we measured an average of 123 MB/s
> read throughput and 56 write IOPS.
> 
> With one hardware queue configured, we measured an average of 165 MB/s
> read throughput and 346 write IOPS.
> 
> Providing this option as a module param will help improve performance of
> certain workloads on certain devices.

I'm in agreement that the current handling of I/O queuing in the storvsc
has problems.  Your data definitely confirms that, and there are other
data points that indicate that we need to more fundamentally rethink
what I/Os get queued where.  Storvsc is letting far too many I/Os get
queued in the VMbus ring buffers and in the underlying Hyper-V.

Adding a module parameter to specify the number of hardware queues
may be part of the solution.  But I really want to step back a bit and
take into account all the data points we have before deciding what to
change, what additional parameters to offer (if any), etc.  There are
other ways of limiting the number of I/Os being queued at the driver
level, and I'm wondering how those tradeoff against adding a module
parameter.   I'm planning to jump in on this topic in just a few weeks,
and would like to coordinate with you.

> 
> In the attached patch, I check that the value provided for
> storvsc_nr_hw_queues is within a valid range at init time and error out
> if it is not. I noticed this warning from scripts/checkpatch.pl
> 
>   WARNING: Prefer [subsystem eg: netdev]_err([subsystem]dev, ... then
> dev_err(dev, ... then pr_err(...  to printk(KERN_ERR ...
>   #64: FILE: drivers/scsi/storvsc_drv.c:2183:
>   printk(KERN_ERR "storvsc: Invalid storvsc_nr_hw_queues value of %d.
> 
> Should I be using a different function for printing this message?

Yes.  If you look at other code in storvsc_drv.c, you'll see the use of
"dev_err" for outputting warning messages.  Follow the pattern of
these other uses of "dev_err", and that should eliminate the
checkpatch error.

Michael

> 
> Regards,
> Melanie (Microsoft)