Re: [PATCH v3 2/7] block: Send requeued requests to the I/O scheduler

Bart Van Assche <bvanassche@xxxxxxx> · Fri, 23 Jun 2023 13:31:58 -0700

On 6/22/23 16:45, Damien Le Moal wrote:
On 6/21/23 09:34, Bart Van Assche wrote:
Regarding removing zone write locking, would it be acceptable to
implement a solution for SCSI devices before it is clear how to
implement a solution for NVMe devices? I think a potential solution for
SCSI devices is to send requests that should be requeued to the SCSI
error handler instead of to the block layer requeue list. The SCSI error
handler waits until all pending requests have timed out or have been
sent to the error handler. The SCSI error handler can be modified such
that requests are sorted in LBA order before being resubmitted. This
would solve the nasty issues that would otherwise arise when requeuing
requests if multiple write requests for the same zone are pending.

I am still thinking that a dedicated hctx for writes to sequential zones may be
the simplest solution for all device types:
1) For scsi HBAs, we can likely gain high qd zone writes, but that needs to be
checked. For AHCI though, we need to keep the max write qd=1 per zone because of
the chipsets reordering command submissions. So we'll need a queue flag saying
"need zone write locking" indicated by the adapter when creating the queue.
2) For NVMe, this would allow high QD writes, with only the penalty of heavier
locking overhead when writes are issued from multiple CPUs.

But I have not started looking at all the details. Need to start prototyping
something. We can try working on this together if you want.

Hi Damien,

I'm interested in collaborating on this. But I'm not sure whether a 
dedicated hardware queue for sequential writes is a full solution. 
Applications must submit zoned writes (other than write appends) in 
order. These zoned writes may end up in a software queue. It is possible 
that the software queues are flushed in such a way that the zoned writes 
are reordered. Or do you perhaps want to send all zoned writes directly 
to a hardware queue? If so, is this really a better solution than a 
single-queue I/O scheduler? Is the difference perhaps that higher read 
IOPS can be achieved because multiple hardware queues are used for reads?

Even if all sequential writes would be sent to a single hardware queue, 
to support queue depths > 1, we still need a mechanism for resubmitting 
requests in order after a request has been requeued. If e.g. three zoned 
writes are in flight and a unit attention is reported for the second 
write then resubmitting the two writes that have to be resubmitted must 
only happen after both writes have completed.

Another possibility is to introduce a new request queue flag that 
specifies that only writes should be sent to the I/O scheduler. I'm 
interested in this because of the following observation for zoned UFS 
devices for a block size of 4 KiB and a random read workload:
* mq-deadline scheduler:  59 K IOPS.
* no I/O scheduler:      100 K IOPS.
In other words, 70% more IOPS with no I/O scheduler compared to 
mq-deadline. I don't think that this indicates a performance bug in the 
mq-deadline scheduler. From a quick measurement with the null_blk driver 
it seems to me that all I/O schedulers saturate around 150 K - 170 K IOPS.

Thanks,

Bart.