Re: [PATCH v3 2/7] block: Send requeued requests to the I/O scheduler

Damien Le Moal <dlemoal@xxxxxxxxxx> · Fri, 23 Jun 2023 08:45:28 +0900

On 6/21/23 09:34, Bart Van Assche wrote:
> On 5/24/23 16:06, Damien Le Moal wrote:
>> When mq-deadline is used, the scheduler lba reordering *may* reorder writes,
>> thus hiding potential bugs in the user write sequence for a zone. That is fine.
>> However, once a write request is dispatched, we should keep the assumption that
>> it is a well formed one, namely directed at the zone write pointer. So any
>> consideration of requeue solving write ordering issues is moot to me.
>>
>> Furthermore, when the requeue happens, the target zone is still locked and the
>> only write request that can be in flight for that target zones is that one being
>> requeued. Add to that the above assumption that the request is the one we must
>> dispatch first, there are absolutely zero chances of seeing a reordering happen
>> for writes to a particular zone. I really do not see the point of requeuing that
>> request through the IO scheduler at all.
> 
> I will drop the changes from this patch that send requeued dispatched 
> requests to the I/O scheduler instead of back to the dispatch list. It 
> took me considerable effort to find all the potential causes of 
> reordering. As you may have noticed I also came up with some changes 
> that are not essential. I should have realized myself that this change 
> is not essential.
> 
>> As long as we keep zone write locking for zoned devices, requeue to the head of
>> the dispatch queue is fine. But maybe this work is preparatory to removing zone
>> write locking ? If that is the case, I would like to see that as well to get the
>> big picture. Otherwise, the latency concerns I raised above are in my opinion, a
>> blocker for this change.
> 
> Regarding removing zone write locking, would it be acceptable to 
> implement a solution for SCSI devices before it is clear how to 
> implement a solution for NVMe devices? I think a potential solution for 
> SCSI devices is to send requests that should be requeued to the SCSI 
> error handler instead of to the block layer requeue list. The SCSI error 
> handler waits until all pending requests have timed out or have been 
> sent to the error handler. The SCSI error handler can be modified such 
> that requests are sorted in LBA order before being resubmitted. This 
> would solve the nasty issues that would otherwise arise when requeuing 
> requests if multiple write requests for the same zone are pending.

I am still thinking that a dedicated hctx for writes to sequential zones may be
the simplest solution for all device types:
1) For scsi HBAs, we can likely gain high qd zone writes, but that needs to be
checked. For AHCI though, we need to keep the max write qd=1 per zone because of
the chipsets reordering command submissions. So we'll need a queue flag saying
"need zone write locking" indicated by the adapter when creating the queue.
2) For NVMe, this would allow high QD writes, with only the penalty of heavier
locking overhead when writes are issued from multiple CPUs.

But I have not started looking at all the details. Need to start prototyping
something. We can try working on this together if you want.

-- 
Damien Le Moal
Western Digital Research