Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD

Ming Lei <ming.lei@xxxxxxxxxx> · Thu, 21 Nov 2019 09:07:30 +0800

On Wed, Nov 20, 2019 at 12:56:21PM -0800, Bart Van Assche wrote:
> On 11/20/19 9:00 AM, Ewan D. Milne wrote:
> > On Wed, 2019-11-20 at 11:05 +0100, Hannes Reinecke wrote:
> > > I must admit I patently don't like this explicit dependency on
> > > blk_nonrot(). Having a conditional counter is just an open invitation to
> > > getting things wrong...
> > 
> > This concerns me as well, it seems like the SCSI ML should have it's
> > own per-device attribute if we actually need to control this per-device
> > instead of on a per-host or per-driver basis.  And it seems like this
> > is something that is specific to high-performance drivers, so changing
> > the way this works for all drivers seems a bit much.
> > 
> > Ordinarily I'd prefer a host template attribute as Sumanesh proposed,
> > but I dislike wrapping the examination of that and the queue flag in
> > a macro that makes it not obvious how the behavior is affected.
> > (Plus Hannes just submitted submitted the patches to remove .use_cmd_list,
> > which was another piece of ML functionality used by only a few drivers.)
> > 
> > Ming's patch does freeze the queue if NONROT is changed by sysfs, but
> > the flag can be changed by other kernel code, e.g. sd_revalidate_disk()
> > clears it and then calls sd_read_block_characteristics() which may set
> > it again.  So it's not clear to me how reliable this is.
> 
> How about changing the default behavior into ignoring sdev->queue_depth and
> only honoring sdev->queue_depth if a "quirk" flag is set or if overridden by
> setting a sysfs attribute?

Using 'quirk' was my first idea in mind when we start to discuss the issue, but
problem is that it isn't flexible, for example, one HBA may connects HDD. in one
setting, and SSD. in another setting.

> My understanding is that the goal of the queue
> ramp-up/ramp-down mechanism is to reduce the number of times a SCSI device
> responds "BUSY".

I don't understand the motivation of ramp-up/ramp-down, maybe it is just
for fairness among LUNs. So far, blk-mq provides fair IO submission
among LUNs. One problem of ignoring it is that sequential IO performance
may be dropped much compared with before.

> An alternative for queue ramp-up/ramp-down is a delayed
> queue re-run. I think if scsi_queue_rq() returns BLK_STS_RESOURCE that the
> queue is only rerun after a delay. From blk_mq_dispatch_rq_list():
> 
> 	[ ... ]
> 	else if (needs_restart && (ret == BLK_STS_RESOURCE))
> 		blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);

The delay re-run can't work given we call blk_mq_get_dispatch_budget()
before dequeuing request from scheduler/sw queue for improving IO merge.
At that time, run queue won't be involved.

Thanks,
Ming