[PATCH v1] scsi: storvsc: Parameterize nr_hw_queues

Melanie Plageman <melanieplageman@xxxxxxxxx> · Tue, 2 Feb 2021 15:25:43 -0500

Proposed patch attached.

While doing some performance tuning of various block device parameters
on Azure to optimize database workloads, we found that reducing the
number of hardware queues (nr_hw_queues used in the block-mq layer and
by storvsc) improved performance of short queue depth writes, like
database journaling.

The Azure premium disks we were testing on often had around 2-3ms
latency for a single small write. Given the IOPS and bandwidth
provisioned on your typical Azure VM paired with an Azure premium disk,
we found that decreasing the LUN queue_depth substantially from the
default (default is 2048) was often required to get reasonable IOPS out
of a small sequential synchronous write with a queue_depth of 1. In our
investigation, the high default queue_depth resulted in such a large
number of outstanding requests against the device that a journaling
write incurred very high latency, slowing overall database performance.

However, even with tuning these defaults (including
/sys/block/sdX/queue/nr_requests), we still incurred high latency,
especially on machines with high core counts, as the number of hardware
queues, nr_hw_queues, is set to be the number of CPUs in storvsc.
nr_hw_queues is, in turn, used to calculate the number of tags available
on the block device, meaning that we still ended up with deeper queues
than intended when requests were submitted to more than one hardware
queue.

We calculated the optimal block device settings, including our intended
queue depth, to utilize as much of provisioned bandwidth as possible
while still getting a reasonable number of IOPS from low queue depth
sequential IO and random IO, but, without parameterizing the number of
hardware queues, we weren't able to fully control this queue_depth.

Attached is a patch which adds a module param to control nr_hw_queues.
The default is the current value (number of CPUs), so it should have no
impact on users who do not set the param.

As a minimal example of the potential benefit, we found that, starting
with a baseline of optimized block device tunings, we could
substantially increase the IOPS of a sequential write job for both a
large Azure premium SSD and a small Azure premium SSD.

On an Azure Standard_F32s_v2 with a single 16 TiB disk, which has a
provisioned max bandwidth of 750 MB/s and provisioned max IOPS of
18000, running Debian 10 with a Linux kernel built from master
(v5.11-rc6 at time of patch testing) with the patch applied, with the
following block device settings:

/sys/block/sdX/device/queue_depth=55

/sys/block/sdX/queue/max_sectors_kb=64,
                    read_ahead_kb=2296,
                    nr_requests=55,
                    wbt_lat_usec=0,
                    scheduler=mq-deadline

And this fio job file:
  [global]
  time_based=1
  ioengine=libaio
  direct=1
  runtime=20

  [job1]
  name=seq_read
  bs=32k
  size=23G
  rw=read
  numjobs=2
  iodepth=110
  iodepth_batch_submit=55
  iodepth_batch_complete=55

  [job2]
  name=seq_write
  bs=8k
  size=10G
  rw=write
  numjobs=1
  iodepth=1
  overwrite=1

With ncpu hardware queues configured, we measured an average of 764 MB/s
read throughput and 153 write IOPS.

With one hardware queue configured, we measured an average of 763 MB/s
read throughput and 270 write IOPS.

And on an Azure Standard_F32s_v2  with a single 16 GiB disk, the
combination having a provisioned max bandwidth of 170 MB/s and a
provisioned max IOPS of 3500, with the following block device settings:

/sys/block/sdX/device/queue_depth=11

/sys/block/sdX/queue/max_sectors_kb=65,
                      read_ahead_kb=520,
                      nr_requests=11,
                      wbt_lat_usec=0,
                      scheduler=mq-deadline

And with this fio job file:
  [global]
  time_based=1
  ioengine=libaio
  direct=1
  runtime=60

  [job1]
  name=seq_read
  bs=32k
  size=5G
  rw=read
  numjobs=2
  iodepth=22
  iodepth_batch_submit=11
  iodepth_batch_complete=11

  [job2]
  name=seq_write
  bs=8k
  size=3G
  rw=write
  numjobs=1
  iodepth=1
  overwrite=1

With ncpu hardware queues configured, we measured an average of 123 MB/s
read throughput and 56 write IOPS.

With one hardware queue configured, we measured an average of 165 MB/s
read throughput and 346 write IOPS.

Providing this option as a module param will help improve performance of
certain workloads on certain devices.

In the attached patch, I check that the value provided for
storvsc_nr_hw_queues is within a valid range at init time and error out
if it is not. I noticed this warning from scripts/checkpatch.pl

  WARNING: Prefer [subsystem eg: netdev]_err([subsystem]dev, ... then
dev_err(dev, ... then pr_err(...  to printk(KERN_ERR ...
  #64: FILE: drivers/scsi/storvsc_drv.c:2183:
  printk(KERN_ERR "storvsc: Invalid storvsc_nr_hw_queues value of %d.

Should I be using a different function for printing this message?

Regards,
Melanie (Microsoft)
Attachment:
v1-0001-scsi-storvsc-Parameterize-number-hardware-queues.patch

Description: Binary data