Re: About scsi device queue depth

James Bottomley <jejb@xxxxxxxxxxxxx> · Tue, 12 Jan 2021 08:47:41 -0800

On Tue, 2021-01-12 at 10:27 +0000, John Garry wrote:
> > > For this case, it seems the opposite - less is more. And I seem
> > > to be hitting closer to the sweet spot there, with more merges.
> > 
> > I think cheaper SSDs have a write latency problem due to erase
> > block issues.  I suspect all SSDs have a channel problem in that
> > there's a certain number of parallel channels and once you go over
> > that number they can't actually work on any more operations even if
> > they can queue them.  For cheaper (as in fewer channels, and less
> > spare erased block capacity) SSDs there will be a benefit to
> > reducing the depth to some multiplier of the channels (I'd guess 2-
> > 4 as the multiplier).  When SSDs become write throttled, there may
> > be less benefit to us queueing in the block layer (merging produces
> > bigger packets with lower overhead, but the erase block consumption
> > will remain the same).
> > 
> > For the record, the internet thinks that cheap SSDs have 2-4
> > channels, so that would argue a tag depth somewhere from 4-16
> 
> I have seen upto 10-channel devices mentioned being "high end" -
> this would mean upto 40 queue depth using on 4x multiplier; so, based
> on that, the current value of 254 for that driver seems way off.

SSD manufacturers don't want us second guessing their device internals
which is why the mostly don't publish the details.  They want to move
us to a place where we don't do any merging at all and just spray all
the I/O packets at the device and let it handle them.

Your study argues they still aren't actually in a place where reality
matches their rhetoric but it you code latency heuristics based on my
guesses they'll likely be wrong for the next generation of devices.

> > > > SSDs have a peculiar lifetime problem in that when they get
> > > > erase block starved they start behaving more like spinning rust
> > > > in that they reach a processing limit but only for writes, so
> > > > lowering the write queue depth (which we don't even have a knob
> > > > for) might be a good solution.  Trying to track the erase block
> > > > problem hasbeen a constant bugbear.
> > > 
> > > I am only doing read performance test here, and the disks are
> > > SAS3.0 SSDs HUSMM1640ASS204, so not exactly slow.
> > 
> > Possibly ... the stats on most manufacturer SSDs don't give you
> > information about the channels or spare erase blocks.
> 
> For my particular disk, this is the datasheet/manual:
> https://documents.westerndigital.com/content/dam/doc-library/en_us/assets/public/western-digital/product/data-center-drives/ultrastar-sas-series/data-sheet-ultrastar-ssd1600ms.pdf
> 
> https://documents.westerndigital.com/content/dam/doc-library/en_us/assets/public/western-digital/product/data-center-drives/ultrastar-sas-series/product-manual-ultrastar-ssd1600mr-1-92tb.pdf
> 
> And I didn't see explicit info regarding channels or spare erase
> blocks, as you expect.

Right, manufacturers simply aren't going to give us that information. 
I also suspect the device won't return queue full usefully either so we
can't track that meaningfully either.

Remember also that all SSDs have a flash translation layer which
transforms our continuous I/O into a scatter/gather list.  This does
argue there are diminishing returns from merging I/Os anyway and
supports the manufacturer argument that we shouldn't be doing merging.

I think the question for us is: if we do discover additional latency in
an SSD read/write queue what do we do with it?  For spinning rust,
using spare latency to merge requests is a definite win because the
drive machinery is way more efficient with contiguous reads/writes,
what would be a similar win with SSDs?

> > > > I'm assuming you're using spinning rust in the above, so it
> > > > sounds like the firmware in the card might be eating the queue
> > > > full returns.  Icould see this happening in RAID mode, but it
> > > > shouldn't happen in jbod mode.
> > > 
> > > Not sure on that, but I didn't check too much. I did try to
> > > increase fio queue depth and sdev queue depth to be very large to
> > > clobber the disks, but still nothing.
> > 
> > If it's an SSD it's likely not giving the queue full you'd need to
> > get the mid-layer to throttle automatically.
> > 
> 
> So it seems that the queue depth we select should depend on class of 
> device, but then the value can also affect write performance.

Well, it does today.  queue full works for most non SSD devices and it
allows us to set a useful queue depth.  We also have the norotational
flag in block to help.

> As for my issue today, I can propose a smaller value for the mpt3sas 
> driver based on my limited tests, and see how the driver maintainers 
> feel about it.

It will run counter to the SSD manufacturers "just give us all your
packets ASAP" mantra so most commercial driver vendors won't want it
changed.

> I just wonder what intelligence we can add for this. And whether
> LLDDs should be selecting this (queue depth) at all, unless they (the
> HBA) have some limits themselves.

I suspect it's not in the block layer at all.  I'd guess the best thing
we can do is report on read and write latency and let something in
userspace see if it wants to adjust the queue depth.

> You did mention maybe a separate write queue depth - could this be a 
> solution?

I was observing that the gating factors for the read and write latency
characteristics are radically different for SSDs, so I would really
expect optimal queue depths to be different on read and write requests.
Now whether there's a benefit to using that latency is a different
question.  If we had a read and a write queue depth, what would we do
with the latency in the write queue?  The problem on writes is the
device erasing blocks ... merging requests doesn't do anything to help
with that problem, so there may be no real benefit to reducing the
write queue depth.

James