Re: About scsi device queue depth

James Bottomley <jejb@xxxxxxxxxxxxx> · Mon, 11 Jan 2021 22:35:40 -0800

On Mon, 2021-01-11 at 17:11 +0000, John Garry wrote:
> On 11/01/2021 16:40, James Bottomley wrote:
> > > So initial sdev queue depth comes from cmd_per_lun by default or
> > > manually setting in the driver via scsi_change_queue_depth(). It
> > > seems to me that some drivers are not setting this optimally, as
> > > above.
> > > 
> > > Thoughts on guidance for setting sdev queue depth? Could blk-mq
> > > changed this behavior?
> 
> Hi James,
> 
> > In general, for spinning rust, you want the minimum queue depth
> > possible for keeping the device active because merging is a very
> > important performance enhancement and once the drive is fully
> > occupied simply sending more tags won't improve latency.  We used
> > to recommend a depth of about 4 for this reason.  A co-operative
> > device can help you find the optimal by returning QUEUE_FULL when
> > it's fully occupied so we have a mechanism to track the queue full
> > returns and change the depth interactively.
> > 
> > For high iops devices, these considerations went out of the window
> > and it's generally assumed (without varying evidence) the more tags
> > the better. 
> 
> For this case, it seems the opposite - less is more. And I seem to
> be hitting closer to the sweet spot there, with more merges.

I think cheaper SSDs have a write latency problem due to erase block
issues.  I suspect all SSDs have a channel problem in that there's a
certain number of parallel channels and once you go over that number
they can't actually work on any more operations even if they can queue
them.  For cheaper (as in fewer channels, and less spare erased block
capacity) SSDs there will be a benefit to reducing the depth to some
multiplier of the channels (I'd guess 2-4 as the multiplier).  When
SSDs become write throttled, there may be less benefit to us queueing
in the block layer (merging produces bigger packets with lower
overhead, but the erase block consumption will remain the same).

For the record, the internet thinks that cheap SSDs have 2-4 channels,
so that would argue a tag depth somewhere from 4-16

> > SSDs have a peculiar lifetime problem in that when they get
> > erase block starved they start behaving more like spinning rust in
> > that they reach a processing limit but only for writes, so lowering
> > the write queue depth (which we don't even have a knob for) might
> > be a good solution.  Trying to track the erase block problem has
> > been a constant bugbear.
> 
> I am only doing read performance test here, and the disks are SAS3.0 
> SSDs HUSMM1640ASS204, so not exactly slow.

Possibly ... the stats on most manufacturer SSDs don't give you
information about the channels or spare erase blocks.

> > I'm assuming you're using spinning rust in the above, so it sounds
> > like the firmware in the card might be eating the queue full
> > returns.  Icould see this happening in RAID mode, but it shouldn't
> > happen in jbod mode.
> 
> Not sure on that, but I didn't check too much. I did try to increase
> fio queue depth and sdev queue depth to be very large to clobber the
> disks, but still nothing.

If it's an SSD it's likely not giving the queue full you'd need to get
the mid-layer to throttle automatically.

James