Thanks for the quick response! When I remove that check on the 6.1.85 kernel version we're using, then it seems that the user space program is losing IOs. I confirmed this with fio. When we hit this issue, fio never completes and is stuck. I can certainly try that later kernel with your fix, if you think there are other changes that prevent losing IOs. -- Michael -----Original Message----- From: Jens Axboe <axboe@xxxxxxxxx> Sent: Tuesday, January 7, 2025 11:13 AM To: Haeuptle, Michael <michael.haeuptle@xxxxxxx>; io-uring@xxxxxxxxxxxxxxx Subject: Re: Bug? CQE.res = -EAGAIN with nvme multipath driver On 1/6/25 7:39 PM, Jens Axboe wrote: > On 1/6/25 7:33 PM, Jens Axboe wrote: >> On 1/6/25 4:53 PM, Jens Axboe wrote: >>> On 1/6/25 1:03 PM, Haeuptle, Michael wrote: >>>> Hello, >>>> >>>> I?m using the nvme multipath driver (NVMF/RDMA) and io-uring. When >>>> a path goes away, I sometimes get a CQE.res = -EAGAIN in user space. >>>> This is unexpected since the nvme multipath driver should handle >>>> this transparently. It?s somewhat workload related but easy to >>>> reproduce with fio. >>>> >>>> The multipath driver uses kblockd worker to re-queue the failed >>>> NVME bios >>>> (https://github.com/torvalds/linux/blob/13563da6ffcf49b8b45772e40b35f96926a7ee1e/drivers/nvme/host/multipath.c#L126). >>>> The original request is ended. >>>> >>>> When the nvme_requeue_work callback is executed, the blk layer >>>> tries to allocate a new request for the bios but that fails and the >>>> bio status is set to BLK_STS_AGAIN >>>> (https://elixir.bootlin.com/linux/v6.12.6/source/block/blk-mq.c#L2987). >>>> The failure to allocate a new req seems to be due to all tags for >>>> the queue being used up. >>>> >>>> Eventually, this makes it into io_uring?s io_rw_should_reissue and >>>> hits same_thread_group(req->tctx->task, current) = false (in >>>> https://github.com/torvalds/linux/blob/13563da6ffcf49b8b45772e40b35f96926a7ee1e/io_uring/rw.c#L437). >>>> As a result, CQE.res = -EAGAIN and thrown back to the user space >>>> program. >>>> >>>> Here?s a stack dump when we hit same_thread_group(req->tctx->task, >>>> current) = false >>>> >>>> kernel: [237700.098733] dump_stack_lvl+0x44/0x5c >>>> kernel: [237700.098737] io_rw_should_reissue.cold+0x5d/0x64 >>>> kernel: [237700.098742] io_complete_rw+0x9a/0xc0 >>>> kernel: [237700.098745] blkdev_bio_end_io_async+0x33/0x80 >>>> kernel: [237700.098749] blk_mq_submit_bio+0x5b5/0x620 >>>> kernel: [237700.098756] submit_bio_noacct_nocheck+0x163/0x370 >>>> kernel: [237700.098760] ? submit_bio_noacct+0x79/0x4b0 >>>> kernel: [237700.098764] nvme_requeue_work+0x4b/0x60 [nvme_core] >>>> kernel: [237700.098776] process_one_work+0x1c7/0x380 >>>> kernel: [237700.098782] worker_thread+0x4d/0x380 >>>> kernel: [237700.098786] ? _raw_spin_lock_irqsave+0x23/0x50 >>>> kernel: [237700.098791] ? rescuer_thread+0x3a0/0x3a0 >>>> kernel: [237700.098794] kthread+0xe9/0x110 >>>> kernel: [237700.098798] ? kthread_complete_and_exit+0x20/0x20 >>>> kernel: [237700.098802] ret_from_fork+0x22/0x30 >>>> kernel: [237700.098811] </TASK> >>>> >>>> Is the same_thread_group() check really needed in this case? The >>>> thread groups are certainly different? Any side effects if this >>>> check is being removed? >>> >>> It's their for safety reasons - across all request types, it's not >>> always safe. For this case, absolutely the check does not need to be >>> there. So probably best to ponder ways to bypass it selectively. >>> >>> Let me ponder a bit what the best approach would be here... >> >> Actually I think we can just remove it. The actual retry will happen >> out of context anyway, and the comment about the import is no longer >> valid as the import will have been done upfront since 6.10. >> >> Do you want to send a patch for that, or do you want me to send one >> out referencing this report? > > Also see: > > commit 039a2e800bcd5beb89909d1a488abf3d647642cf > Author: Jens Axboe <axboe@xxxxxxxxx> > Date: Thu Apr 25 09:04:32 2024 -0600 > > io_uring/rw: reinstate thread check for retries > > let me take a closer look tomorrow... If you can test a custom kernel, can you give this branch a try? git://git.kernel.dk/linux.git io_uring-rw-retry -- Jens Axboe