On 11/17/22 20:59, Andreas Hindborg wrote: > > Ming Lei <ming.lei@xxxxxxxxxx> writes: > >> On Thu, Nov 17, 2022 at 10:07:14AM +0100, Andreas Hindborg wrote: >>> >>> Ming Lei <ming.lei@xxxxxxxxxx> writes: >>> >>>> On Thu, Nov 17, 2022 at 09:05:48AM +0100, Andreas Hindborg wrote: >>>>> >>>>> Ming Lei <ming.lei@xxxxxxxxxx> writes: >>>>> >>>>>> Hi Andreas, >>>>>> >>>>>> On Wed, Nov 16, 2022 at 04:00:15PM +0100, Andreas Hindborg wrote: >>>>>>> >>>>>>> Hi Ming, >>>>>>> >>>>>>> I have a question regarding ublk. For context, I am working on adding >>>>>>> zoned storage support to ublk, and you have been very kind to help me on >>>>>>> Github [1]. >>>>>>> >>>>>>> I have a problem with ordering of requests after they are issued to the >>>>>>> ublk driver. Basically ublk will reverse the ordering of requests that are >>>>>>> batched. The behavior can be observed with the following fio workload: >>>>>>> >>>>>>>> fio --name=test --ioengine=io_uring --rw=read --bs=4m --direct=1 >>>>>>>> --size=4m --filename=/dev/ublkb0 >>>>>>> >>>>>>> For a loopback ublk target I get the following from blktrace: >>>>>>> >>>>>>>> 259,2 0 3469 286.337681303 724 D R 0 + 1024 [fio] >>>>>>>> 259,2 0 3470 286.337691313 724 D R 1024 + 1024 [fio] >>>>>>>> 259,2 0 3471 286.337694423 724 D R 2048 + 1024 [fio] >>>>>>>> 259,2 0 3472 286.337696583 724 D R 3072 + 1024 [fio] >>>>>>>> 259,2 0 3473 286.337698433 724 D R 4096 + 1024 [fio] >>>>>>>> 259,2 0 3474 286.337700213 724 D R 5120 + 1024 [fio] >>>>>>>> 259,2 0 3475 286.337702723 724 D R 6144 + 1024 [fio] >>>>>>>> 259,2 0 3476 286.337704323 724 D R 7168 + 1024 [fio] >>>>>>>> 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] >>>>>>>> 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] >>>>>>>> 259,1 0 1796 286.337816274 390 D R 2048 + 2048 [ublk] >>>>>>>> 259,1 0 1797 286.337821744 390 D R 0 + 2048 [ublk] >>>>>>> >>>>>>> And enabling debug prints in ublk shows that the reversal happens when >>>>>>> ublk defers work to the io_uring context: >>>>>>> >>>>>>>> kernel: ublk_queue_rq: qid 0, tag 180, sect 0 >>>>>>>> kernel: ublk_queue_rq: qid 0, tag 181, sect 1024 >>>>>>>> kernel: ublk_queue_rq: qid 0, tag 182, sect 2048 >>>>>>>> kernel: ublk_queue_rq: qid 0, tag 183, sect 3072 >>>>>>>> kernel: ublk_queue_rq: qid 0, tag 184, sect 4096 >>>>>>>> kernel: ublk_queue_rq: qid 0, tag 185, sect 5120 >>>>>>>> kernel: ublk_queue_rq: qid 0, tag 186, sect 6144 >>>>>>>> kernel: ublk_queue_rq: qid 0, tag 187, sect 7168 >>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 187 io_flags 1 addr 7f245d359000, sect 7168 >>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 186 io_flags 1 addr 7f245fcfd000, sect 6144 >>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 185 io_flags 1 addr 7f245fd7f000, sect 5120 >>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 184 io_flags 1 addr 7f245fe01000, sect 4096 >>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 183 io_flags 1 addr 7f245fe83000, sect 3072 >>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 182 io_flags 1 addr 7f245ff05000, sect 2048 >>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 181 io_flags 1 addr 7f245ff87000, sect 1024 >>>>>>>> kernel: __ublk_rq_task_work: complete: op 33, qid 0 tag 180 io_flags 1 addr 7f2460009000, sect 0 >>>>>>> >>>>>>> The problem seems to be that the method used to defer work to the >>>>>>> io_uring context, task_work_add(), is using a stack to queue the >>>>>>> callbacks. >>>>>> >>>>>> Is the observation done on zoned ublk or plain ublk-loop? >>>>> >>>>> I collected this trace on my fork, but since the behavior comes from >>>>> task_work.c using a stack to queue work, it would be present on upstream >>>>> ublk as well. For completeness, I have verified that this is the case. >>>>> >>>>>> If it is plain ublk-loop, the reverse order can be started from >>>>>> blk_add_rq_to_plug(), but it won't happen for zoned write request >>>>>> which isn't queued to plug list. >>>>> >>>>> I forgot to mention that I set the deadline scheduler for both ublkb0 >>>>> and the loopback target. No reordering should happen in mq with the >>>>> deadline scheduler, as far as I understand. >>>> >>>> I meant you need to setup one zoned ublk-loop for observing write request >>>> order, block layer never guarantees request order for other devices. >>>> >>>>> >>>>>> >>>>>> Otherwise, can you observe zoned write req reorder from ublksrv side? >>>>>> >>>>> >>>>> Yes, the reverse order is observable in ublk server. Reordering happens >>>>> in ublk kernel driver. >>>>> >>>>>>> >>>>>>> As you probably are aware, writes to zoned storage must >>>>>>> happen at the write pointer, so when order is lost there is a problem. >>>>>>> But I would assume that this behavior is also undesirable in other >>>>>>> circumstances. >>>>>>> >>>>>>> I am not sure what is the best approach to change this behavor. Maybe >>>>>>> having a separate queue for the requests and then only scheduling work >>>>>>> if a worker is not already processing the queue? >>>>>>> >>>>>>> If you like, I can try to implement a fix? >>>>>> >>>>>> Yeah, I think zoned write requests could be forwarded to ublk server out of order. >>>>> >>>>> In reverse order for requests that are issued without a context switch, >>>>> as far as I can tell. >>>>> >>>>>> >>>>>> Is it possible for re-order the writes in ublksrv side? I guess it is >>>>>> be doable: >>>>>> >>>>>> 1) zoned write request is sent to ublk_queue_rq() in order always >>>>>> >>>>>> 2) when ublksrv zone target/backend code gets zoned write request in each >>>>>> zone, it can wait until the next expected write comes, then handle the >>>>>> write and advance write pointer, then repeat the whole process. >>>>>> >>>>>> 3) because of 1), the next expected zoned write req is guaranteed to >>>>>> come without much delay, and the per-zone queue won't be long. >>>>> >>>>> I think it is not feasible to have per zone data structures. Instead, I >>>> >>>> If one mapping data structure is used for ordering per-zone write >>>> request, it could be pretty easy, such as C++'s map or hash table, and it >>>> won't take much memory given the max in-flight IOs are very limited and >>>> the zone mapping entry can be reused among different zone, and quite easy >>>> to implement. >>>> >>>> Also most of times, the per-zone ordering can be completed in current >>>> batch(requests completed from single io_uring_enter()), then the extra >>>> cost could be close to zero, you can simply run the per-zone ordering in >>>> ->handle_io_background() callback, when all requests could come for each >>>> zone most of times. >>>> >>>>> considered adding a sequence number to each request, and then queue up >>>>> requests if there is a gap in the numbering. >>>> >>>> IMO, that won't be doable, given you don't know when the sequence will be over. >>> >>> We would not need to know when the sequence is over. If in ubdsrv we see >>> request number 1,2,4, we could hold 4 until 3 shows up. When 3 shows up >>> we go ahead and submit all requests from 3 and up, until there is >>> another gap. If ublk_drv is setting the sequence numbers, >>> cancelled/aborted requests would not be an issue. >>> >>> I think this would be less overhead than having per zone data structure. >> >> You can only assign it to zoned write request, but you still have to check >> the sequence inside each zone, right? Then why not just check LBAs in >> each zone simply? > > We would need to know the zone map, which is not otherwise required. > Then we would need to track the write pointer for each open zone for > each queue, so that we can stall writes that are not issued at the write > pointer. This is in effect all zones, because we cannot track when zones > are implicitly closed. Then, if different queues are issuing writes to > the same zone, we need to sync across queues. Userspace may have > synchronization in place to issue writes with multiple threads while > still hitting the write pointer. > > It is not good enough to only impose ordering within a zone if we have > requests in flight for that zone. For the first write to a zone when there > are no writes in flight to that zone, we can not know if the write is at > the write pointer, or if more writes to lower LBA is on the way. Not > without tracking the write pointer. > > With a sequence number, the sequence number can be queue local. It would > not guarantee that writes always happen at the write pointer, but it > would guarantee that requests are not reordered by ublk_drv, which is > all that is required. As long as userspace is issuing at the write > pointer (as they are required to for zoned storage), this solution would > work, even across multiple queues issuing writes to the same zone. > >> >>> >>> But I still think it is an unnecessary hack. We could just fix the driver. >> >> But not sure if it is easy to make io_uring_cmd_complete_in_task() or >> task_work_add to complete request in order efficiently. >> >>> >>>> >>>>> >>>>> But really, the issue should be resolved by changing the way >>>>> ublk_queue_rq() is sending work to uring context with task_add_work(). >>>> >>>> Not sure it can be solved easily given llist is implemented in this way. >>> >>> If we queue requests on a separate queue, we are fine. I can make a >>> patch suggestion. >> >> The problem is that io_uring_cmd_complete_in_task() or task_work_add() may >> re-order the request, can you explain why the issue can be solved by >> separate queue? > > I would suggest to still use task_work_add() to queue a callback, but > instead of carrying the pdu as the callback argument, use it as a > notification that one or more work items are queued. Then we add the pdu > to a hwctx local FIFO queue. > > __ublk_rq_task_work() would then check this FIFO queue and submit all > the requests on the queue to userspace. > > Without further optimization __ublk_rq_task_work() would some times be > called with an empty queue, but this should not be too much overhead. > Maybe we could decide to not call task_work_add() if the hwctx local > queue of pdus is not empty. > >> >>> >>>> >>>>> Fixing in ublksrv is a bit of a hack and will have performance penalty. >>>> >>>> As I mentioned, ordering zoned write request in each zone just takes >>>> a little memory(mapping, or hashing) and the max size of the mapping >>>> table is queue depth, and basically zero cpu cost, not see extra >>>> performance penalty is involved. >>> >>> I could implement all three solutions. 1) fix the dirver, 2) per zone >>> structure and 3) sequence numbers. Then I benchmark them and we will >>> know what works. It's a lot of work though. >> >> Let's prove if the theory for each solution is correct first. > > Alright. > >> >>> >>>> >>>>> >>>>> Also, it can not be good for any storage device to have sequential >>>>> requests delivered in reverse order? Fixing this would benefit all targets. >>>> >>>> Only zoned target has strict ordering requirement which does take cost, block >>>> layer never guarantees request order. As I mentioned, blk_add_rq_to_plug() >>>> can re-order requests in reverse order too. >>> >>> When mq-deadline scheduler is set for a queue, requests are not >>> reordered, right? >> >> In case of batching submission, mq-deadline will order requests by >> LBA in this whole batch. > > That is not what I am seeing in practice. I tried looking through > mq-deadline.c, but I could not find code that would order by LBA. Could > you point me to the code that implements this policy? see the use of the rb tree, deadline_add_rq_rb() and friends. The rb tree maintains the requests in LBA order. The fifo list, maintains an arrival order for starvation control. > >> >>> >>> I am no block layer expert, so I cannot argue about the implementation. >>> But I think that both spinning rust and flash would benefit from having >>> sequential requests delivered in order? Would it not hurt performance to >>> reverse order for sequential requests all the time? I have a hard time >>> understanding why the block layer would do this by default. >>> >>> One thing is to offer no guarantees, but to _always_ reverse the >>> ordering of sequential requests seem a little counter productive to me. >> >> If io_uring_cmd_complete_in_task()/task_work_add() can complete io >> commands to ublksrv in order without extra cost, that is better. > > I agree :) > >> >> But I don't think it is a big deal for HDD. because when these requests >> are re-issued from ublksrv to block layer, deadline or bfq will order >> them too since all are submitted via io_uring in batch. > > As I wrote, that is not what I am seeing in my experiment. Repeating the > output of blktrace from the top of this email: > > 259,2 0 3469 286.337681303 724 D R 0 + 1024 [fio] > 259,2 0 3470 286.337691313 724 D R 1024 + 1024 [fio] > 259,2 0 3471 286.337694423 724 D R 2048 + 1024 [fio] > 259,2 0 3472 286.337696583 724 D R 3072 + 1024 [fio] > 259,2 0 3473 286.337698433 724 D R 4096 + 1024 [fio] > 259,2 0 3474 286.337700213 724 D R 5120 + 1024 [fio] > 259,2 0 3475 286.337702723 724 D R 6144 + 1024 [fio] > 259,2 0 3476 286.337704323 724 D R 7168 + 1024 [fio] > 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] > 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] > 259,1 0 1796 286.337816274 390 D R 2048 + 2048 [ublk] > 259,1 0 1797 286.337821744 390 D R 0 + 2048 [ublk] > > Here the read of 6144: > 259,1 0 1794 286.337794934 390 D R 6144 + 2048 [ublk] > > is clearly issued before the read of 4096: > 259,1 0 1795 286.337805504 390 D R 4096 + 2048 [ublk] > > It is not because there are no IO in flight for the target device. Here > is he trace with completions included: > > 259,2 10 1 0.000000000 468 D R 0 + 1024 [fio] > 259,2 10 2 0.000005680 468 D R 1024 + 1024 [fio] > 259,2 10 3 0.000007760 468 D R 2048 + 1024 [fio] > 259,2 10 4 0.000009140 468 D R 3072 + 1024 [fio] > 259,2 10 5 0.000010420 468 D R 4096 + 1024 [fio] > 259,2 10 6 0.000011640 468 D R 5120 + 1024 [fio] > 259,2 10 7 0.000013350 468 D R 6144 + 1024 [fio] > 259,2 10 8 0.000014350 468 D R 7168 + 1024 [fio] > 259,1 10 1 0.000280540 412 D R 6144 + 2048 [ublk] > 259,1 10 2 0.000433260 412 D R 4096 + 2048 [ublk] > 259,1 10 3 0.000603920 412 D R 2048 + 2048 [ublk] > 259,1 10 4 0.000698070 412 D R 0 + 2048 [ublk] > 259,1 10 5 0.000839250 0 C R 6144 + 2048 [0] > 259,2 10 9 0.000891270 412 C R 7168 + 1024 [0] > 259,2 10 10 0.000919780 412 C R 6144 + 1024 [0] > 259,1 10 6 0.001258791 0 C R 4096 + 2048 [0] > 259,2 10 11 0.001306541 412 C R 5120 + 1024 [0] > 259,2 10 12 0.001335291 412 C R 4096 + 1024 [0] > 259,1 10 7 0.001469911 0 C R 2048 + 2048 [0] > 259,2 10 13 0.001518281 412 C R 3072 + 1024 [0] > 259,2 10 14 0.001547041 412 C R 2048 + 1024 [0] > 259,1 10 8 0.001600801 0 C R 0 + 2048 [0] > 259,2 10 15 0.001641871 412 C R 1024 + 1024 [0] > 259,2 10 16 0.001671921 412 C R 0 + 1024 [0] > > This last trace is vanilla Linux 6.0 with mq-deadline enabled for ublkb0(259,2) > and the loopback target nvme0n1(259,1). queue at head that should be queue at tail ? if nvme0n1 is a multi queue device and does not have a scheduler, there may be a lot of "issue directly" that can really destroy the order of requests. > > Maybe I am missing some configuration? > > Best regards, > Andreas -- Damien Le Moal Western Digital Research