Re: Reordering of ublk IO requests

Damien Le Moal <damien.lemoal@xxxxxxxxxxxxxxxxxx> · Thu, 17 Nov 2022 22:11:12 +0900

On 11/17/22 20:59, Andreas Hindborg wrote:
> 
> Ming Lei <ming.lei@xxxxxxxxxx> writes:
> 
>> On Thu, Nov 17, 2022 at 10:07:14AM +0100, Andreas Hindborg wrote:
>>>
>>> Ming Lei <ming.lei@xxxxxxxxxx> writes:
>>>
>>>> On Thu, Nov 17, 2022 at 09:05:48AM +0100, Andreas Hindborg wrote:
>>>>>
>>>>> Ming Lei <ming.lei@xxxxxxxxxx> writes:
>>>>>
>>>>>> Hi Andreas,
>>>>>>
>>>>>> On Wed, Nov 16, 2022 at 04:00:15PM +0100, Andreas Hindborg wrote:
>>>>>>>
>>>>>>> Hi Ming,
>>>>>>>
>>>>>>> I have a question regarding ublk. For context, I am working on adding
>>>>>>> zoned storage support to ublk, and you have been very kind to help me on
>>>>>>> Github [1].
>>>>>>>
>>>>>>> I have a problem with ordering of requests after they are issued to the
>>>>>>> ublk driver. Basically ublk will reverse the ordering of requests that are
>>>>>>> batched. The behavior can be observed with the following fio workload:
>>>>>>>
>>>>>>>> fio --name=test --ioengine=io_uring --rw=read --bs=4m --direct=1
>>>>>>>> --size=4m --filename=/dev/ublkb0
>>>>>>>
>>>>>>> For a loopback ublk target I get the following from blktrace:
>>>>>>>
>>>>>>>> 259,2    0     3469   286.337681303   724  D   R 0 + 1024 [fio]
>>>>>>>> 259,2    0     3470   286.337691313   724  D   R 1024 + 1024 [fio]
>>>>>>>> 259,2    0     3471   286.337694423   724  D   R 2048 + 1024 [fio]
>>>>>>>> 259,2    0     3472   286.337696583   724  D   R 3072 + 1024 [fio]
>>>>>>>> 259,2    0     3473   286.337698433   724  D   R 4096 + 1024 [fio]
>>>>>>>> 259,2    0     3474   286.337700213   724  D   R 5120 + 1024 [fio]
>>>>>>>> 259,2    0     3475   286.337702723   724  D   R 6144 + 1024 [fio]
>>>>>>>> 259,2    0     3476   286.337704323   724  D   R 7168 + 1024 [fio]
>>>>>>>> 259,1    0     1794   286.337794934   390  D   R 6144 + 2048 [ublk]
>>>>>>>> 259,1    0     1795   286.337805504   390  D   R 4096 + 2048 [ublk]
>>>>>>>> 259,1    0     1796   286.337816274   390  D   R 2048 + 2048 [ublk]
>>>>>>>> 259,1    0     1797   286.337821744   390  D   R 0 + 2048 [ublk]
>>>>>>>
>>>>>>> And enabling debug prints in ublk shows that the reversal happens when
>>>>>>> ublk defers work to the io_uring context:
>>>>>>>
>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 180, sect 0
>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 181, sect 1024
>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 182, sect 2048
>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 183, sect 3072
>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 184, sect 4096
>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 185, sect 5120
>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 186, sect 6144
>>>>>>>> kernel: ublk_queue_rq: qid 0, tag 187, sect 7168
>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 187 io_flags 1 addr 7f245d359000, sect 7168
>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 186 io_flags 1 addr 7f245fcfd000, sect 6144
>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 185 io_flags 1 addr 7f245fd7f000, sect 5120
>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 184 io_flags 1 addr 7f245fe01000, sect 4096
>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 183 io_flags 1 addr 7f245fe83000, sect 3072
>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 182 io_flags 1 addr 7f245ff05000, sect 2048
>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 181 io_flags 1 addr 7f245ff87000, sect 1024
>>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 180 io_flags 1 addr 7f2460009000, sect 0
>>>>>>>
>>>>>>> The problem seems to be that the method used to defer work to the
>>>>>>> io_uring context, task_work_add(), is using a stack to queue the
>>>>>>> callbacks.
>>>>>>
>>>>>> Is the observation done on zoned ublk or plain ublk-loop?
>>>>>
>>>>> I collected this trace on my fork, but since the behavior comes from
>>>>> task_work.c using a stack to queue work, it would be present on upstream
>>>>> ublk as well. For completeness, I have verified that this is the case.
>>>>>
>>>>>> If it is plain ublk-loop, the reverse order can be started from
>>>>>> blk_add_rq_to_plug(), but it won't happen for zoned write request
>>>>>> which isn't queued to plug list.
>>>>>
>>>>> I forgot to mention that I set the deadline scheduler for both ublkb0
>>>>> and the loopback target. No reordering should happen in mq with the
>>>>> deadline scheduler, as far as I understand.
>>>>
>>>> I meant you need to setup one zoned ublk-loop for observing write request
>>>> order, block layer never guarantees request order for other devices.
>>>>
>>>>>
>>>>>>
>>>>>> Otherwise, can you observe zoned write req reorder from ublksrv side?
>>>>>>
>>>>>
>>>>> Yes, the reverse order is observable in ublk server. Reordering happens
>>>>> in ublk kernel driver.
>>>>>
>>>>>>>
>>>>>>> As you probably are aware, writes to zoned storage must
>>>>>>> happen at the write pointer, so when order is lost there is a problem.
>>>>>>> But I would assume that this behavior is also undesirable in other
>>>>>>> circumstances.
>>>>>>>
>>>>>>> I am not sure what is the best approach to change this behavor. Maybe
>>>>>>> having a separate queue for the requests and then only scheduling work
>>>>>>> if a worker is not already processing the queue?
>>>>>>>
>>>>>>> If you like, I can try to implement a fix?
>>>>>>
>>>>>> Yeah, I think zoned write requests could be forwarded to ublk server out of order.
>>>>>
>>>>> In reverse order for requests that are issued without a context switch,
>>>>> as far as I can tell.
>>>>>
>>>>>>
>>>>>> Is it possible for re-order the writes in ublksrv side? I guess it is
>>>>>> be doable:
>>>>>>
>>>>>> 1) zoned write request is sent to ublk_queue_rq() in order always
>>>>>>
>>>>>> 2) when ublksrv zone target/backend code gets zoned write request in each
>>>>>> zone, it can wait until the next expected write comes, then handle the
>>>>>> write and advance write pointer, then repeat the whole process.
>>>>>>
>>>>>> 3) because of 1), the next expected zoned write req is guaranteed to
>>>>>> come without much delay, and the per-zone queue won't be long.
>>>>>
>>>>> I think it is not feasible to have per zone data structures. Instead, I
>>>>
>>>> If one mapping data structure is used for ordering per-zone write
>>>> request, it could be pretty easy, such as C++'s map or hash table, and it
>>>> won't take much memory given the max in-flight IOs are very limited and
>>>> the zone mapping entry can be reused among different zone, and quite easy
>>>> to implement.
>>>>
>>>> Also most of times, the per-zone ordering can be completed in current
>>>> batch(requests completed from single io_uring_enter()), then the extra
>>>> cost could be close to zero, you can simply run the per-zone ordering in
>>>> ->handle_io_background() callback, when all requests could come for each
>>>> zone most of times.
>>>>
>>>>> considered adding a sequence number to each request, and then queue up
>>>>> requests if there is a gap in the numbering.
>>>>
>>>> IMO, that won't be doable, given you don't know when the sequence will be over.
>>>
>>> We would not need to know when the sequence is over. If in ubdsrv we see
>>> request number 1,2,4, we could hold 4 until 3 shows up. When 3 shows up
>>> we go ahead and submit all requests from 3 and up, until there is
>>> another gap. If ublk_drv is setting the sequence numbers,
>>> cancelled/aborted requests would not be an issue.
>>>
>>> I think this would be less overhead than having per zone data structure.
>>
>> You can only assign it to zoned write request, but you still have to check
>> the sequence inside each zone, right? Then why not just check LBAs in
>> each zone simply?
> 
> We would need to know the zone map, which is not otherwise required.
> Then we would need to track the write pointer for each open zone for
> each queue, so that we can stall writes that are not issued at the write
> pointer. This is in effect all zones, because we cannot track when zones
> are implicitly closed. Then, if different queues are issuing writes to
> the same zone, we need to sync across queues. Userspace may have
> synchronization in place to issue writes with multiple threads while
> still hitting the write pointer.
> 
> It is not good enough to only impose ordering within a zone if we have
> requests in flight for that zone. For the first write to a zone when there
> are no writes in flight to that zone, we can not know if the write is at
> the write pointer, or if more writes to lower LBA is on the way. Not
> without tracking the write pointer.
> 
> With a sequence number, the sequence number can be queue local. It would
> not guarantee that writes always happen at the write pointer, but it
> would guarantee that requests are not reordered by ublk_drv, which is
> all that is required. As long as userspace is issuing at the write
> pointer (as they are required to for zoned storage), this solution would
> work, even across multiple queues issuing writes to the same zone.
> 
>>
>>>
>>> But I still think it is an unnecessary hack. We could just fix the driver.
>>
>> But not sure if it is easy to make io_uring_cmd_complete_in_task() or
>> task_work_add to complete request in order efficiently.
>>
>>>
>>>>
>>>>>
>>>>> But really, the issue should be resolved by changing the way
>>>>> ublk_queue_rq() is sending work to uring context with task_add_work().
>>>>
>>>> Not sure it can be solved easily given llist is implemented in this way.
>>>
>>> If we queue requests on a separate queue, we are fine. I can make a
>>> patch suggestion.
>>
>> The problem is that io_uring_cmd_complete_in_task() or task_work_add() may
>> re-order the request, can you explain why the issue can be solved by
>> separate queue?
> 
> I would suggest to still use task_work_add() to queue a callback, but
> instead of carrying the pdu as the callback argument, use it as a
> notification that one or more work items are queued. Then we add the pdu
> to a hwctx local FIFO queue.
> 
> __ublk_rq_task_work() would then check this FIFO queue and submit all
> the requests on the queue to userspace.
> 
> Without further optimization __ublk_rq_task_work() would some times be
> called with an empty queue, but this should not be too much overhead.
> Maybe we could decide to not call task_work_add() if the hwctx local
> queue of pdus is not empty.
> 
>>
>>>
>>>>
>>>>> Fixing in ublksrv is a bit of a hack and will have performance penalty.
>>>>
>>>> As I mentioned, ordering zoned write request in each zone just takes
>>>> a little memory(mapping, or hashing) and the max size of the mapping
>>>> table is queue depth, and basically zero cpu cost, not see extra
>>>> performance penalty is involved.
>>>
>>> I could implement all three solutions. 1) fix the dirver, 2) per zone
>>> structure and 3) sequence numbers. Then I benchmark them and we will
>>> know what works. It's a lot of work though.
>>
>> Let's prove if the theory for each solution is correct first.
> 
> Alright.
> 
>>
>>>
>>>>
>>>>>
>>>>> Also, it can not be good for any storage device to have sequential
>>>>> requests delivered in reverse order? Fixing this would benefit all targets.
>>>>
>>>> Only zoned target has strict ordering requirement which does take cost, block
>>>> layer never guarantees request order. As I mentioned, blk_add_rq_to_plug()
>>>> can re-order requests in reverse order too.
>>>
>>> When mq-deadline scheduler is set for a queue, requests are not
>>> reordered, right?
>>
>> In case of batching submission, mq-deadline will order requests by
>> LBA in this whole batch.
> 
> That is not what I am seeing in practice. I tried looking through
> mq-deadline.c, but I could not find code that would order by LBA. Could
> you point me to the code that implements this policy?

see the use of the rb tree, deadline_add_rq_rb() and friends. The rb tree
maintains the requests in LBA order. The fifo list, maintains an arrival
order for starvation control.

> 
>>
>>>
>>> I am no block layer expert, so I cannot argue about the implementation.
>>> But I think that both spinning rust and flash would benefit from having
>>> sequential requests delivered in order? Would it not hurt performance to
>>> reverse order for sequential requests all the time? I have a hard time
>>> understanding why the block layer would do this by default.
>>>
>>> One thing is to offer no guarantees, but to _always_ reverse the
>>> ordering of sequential requests seem a little counter productive to me.
>>
>> If io_uring_cmd_complete_in_task()/task_work_add() can complete io
>> commands to ublksrv in order without extra cost, that is better.
> 
> I agree :)
> 
>>
>> But I don't think it is a big deal for HDD. because when these requests
>> are re-issued from ublksrv to block layer, deadline or bfq will order
>> them too since all are submitted via io_uring in batch.
> 
> As I wrote, that is not what I am seeing in my experiment. Repeating the
> output of blktrace from the top of this email:
> 
> 259,2    0     3469   286.337681303   724  D   R 0 + 1024 [fio]
> 259,2    0     3470   286.337691313   724  D   R 1024 + 1024 [fio]
> 259,2    0     3471   286.337694423   724  D   R 2048 + 1024 [fio]
> 259,2    0     3472   286.337696583   724  D   R 3072 + 1024 [fio]
> 259,2    0     3473   286.337698433   724  D   R 4096 + 1024 [fio]
> 259,2    0     3474   286.337700213   724  D   R 5120 + 1024 [fio]
> 259,2    0     3475   286.337702723   724  D   R 6144 + 1024 [fio]
> 259,2    0     3476   286.337704323   724  D   R 7168 + 1024 [fio]
> 259,1    0     1794   286.337794934   390  D   R 6144 + 2048 [ublk]
> 259,1    0     1795   286.337805504   390  D   R 4096 + 2048 [ublk]
> 259,1    0     1796   286.337816274   390  D   R 2048 + 2048 [ublk]
> 259,1    0     1797   286.337821744   390  D   R 0 + 2048 [ublk]
> 
> Here the read of 6144:
> 259,1    0     1794   286.337794934   390  D   R 6144 + 2048 [ublk]
> 
> is clearly issued before the read of 4096:
> 259,1    0     1795   286.337805504   390  D   R 4096 + 2048 [ublk]
> 
> It is not because there are no IO in flight for the target device. Here
> is he trace with completions included:
> 
> 259,2   10        1     0.000000000   468  D   R 0 + 1024 [fio]
> 259,2   10        2     0.000005680   468  D   R 1024 + 1024 [fio]
> 259,2   10        3     0.000007760   468  D   R 2048 + 1024 [fio]
> 259,2   10        4     0.000009140   468  D   R 3072 + 1024 [fio]
> 259,2   10        5     0.000010420   468  D   R 4096 + 1024 [fio]
> 259,2   10        6     0.000011640   468  D   R 5120 + 1024 [fio]
> 259,2   10        7     0.000013350   468  D   R 6144 + 1024 [fio]
> 259,2   10        8     0.000014350   468  D   R 7168 + 1024 [fio]
> 259,1   10        1     0.000280540   412  D   R 6144 + 2048 [ublk]
> 259,1   10        2     0.000433260   412  D   R 4096 + 2048 [ublk]
> 259,1   10        3     0.000603920   412  D   R 2048 + 2048 [ublk]
> 259,1   10        4     0.000698070   412  D   R 0 + 2048 [ublk]
> 259,1   10        5     0.000839250     0  C   R 6144 + 2048 [0]
> 259,2   10        9     0.000891270   412  C   R 7168 + 1024 [0]
> 259,2   10       10     0.000919780   412  C   R 6144 + 1024 [0]
> 259,1   10        6     0.001258791     0  C   R 4096 + 2048 [0]
> 259,2   10       11     0.001306541   412  C   R 5120 + 1024 [0]
> 259,2   10       12     0.001335291   412  C   R 4096 + 1024 [0]
> 259,1   10        7     0.001469911     0  C   R 2048 + 2048 [0]
> 259,2   10       13     0.001518281   412  C   R 3072 + 1024 [0]
> 259,2   10       14     0.001547041   412  C   R 2048 + 1024 [0]
> 259,1   10        8     0.001600801     0  C   R 0 + 2048 [0]
> 259,2   10       15     0.001641871   412  C   R 1024 + 1024 [0]
> 259,2   10       16     0.001671921   412  C   R 0 + 1024 [0]
> 
> This last trace is vanilla Linux 6.0 with mq-deadline enabled for ublkb0(259,2)
> and the loopback target nvme0n1(259,1).

queue at head that should be queue at tail ? if nvme0n1 is a multi queue
device and does not have a scheduler, there may be a lot of "issue
directly" that can really destroy the order of requests.

> 
> Maybe I am missing some configuration?
> 
> Best regards,
> Andreas

-- 
Damien Le Moal
Western Digital Research