On 4/16/24 6:40 AM, Pavel Begunkov wrote: > On 4/16/24 13:24, Jens Axboe wrote: >> On 4/16/24 6:14 AM, Pavel Begunkov wrote: >>> On 4/16/24 12:38, Jens Axboe wrote: >>>> On 4/16/24 4:00 AM, Ming Lei wrote: >>>>> On Tue, Apr 16, 2024 at 10:26:16AM +0800, Changhui Zhong wrote: >>>>>>>> >>>>>>>> I can't reproduce this here, fwiw. Ming, something you've seen? >>>>>>> >>>>>>> I just test against the latest for-next/block(-rc4 based), and still can't >>>>>>> reproduce it. There was such RH internal report before, and maybe not >>>>>>> ublk related. >>>>>>> >>>>>>> Changhui, if the issue can be reproduced in your machine, care to share >>>>>>> your machine for me to investigate a bit? >>>>>>> >>>>>>> Thanks, >>>>>>> Ming >>>>>>> >>>>>> >>>>>> I still can reproduce this issue on my machine? >>>>>> and I shared machine to Ming?he can do more investigation for this issue? >>>>>> >>>>>> [ 1244.207092] running generic/006 >>>>>> [ 1246.456896] blk_print_req_error: 77 callbacks suppressed >>>>>> [ 1246.456907] I/O error, dev ublkb1, sector 2395864 op 0x1:(WRITE) >>>>>> flags 0x8800 phys_seg 1 prio class 0 >>>>> >>>>> The failure is actually triggered in recovering qcow2 target in generic/005, >>>>> since ublkb0 isn't removed successfully in generic/005. >>>>> >>>>> git-bisect shows that the 1st bad commit is cca6571381a0 ("io_uring/rw: >>>>> cleanup retry path"). >>>>> >>>>> And not see any issue in uring command side, so the trouble seems >>>>> in normal io_uring rw side over XFS file, and not related with block >>>>> device. >>>> >>>> Indeed, I can reproduce it on XFS as well. I'll take a look. >>> >>> Looking at this patch, that io_rw_should_reissue() path is for when >>> we failed via the kiocb callback but came there off of the submission >>> path, so when we unwind back it finds the flag, preps and resubmits >>> the req. If it's not the case but we still return "true", it'd leaks >>> the request, which would explains why exit_work gets stuck. >> >> Yep, this is what happens. I have a test patch that just punts any >> reissue to task_work, it'll insert to iowq from there. Before we would >> fail it, even though we didn't have to, but that check was killed and >> then it just lingers for this case and it's lost. > > Sounds good, but let me note that while unwinding, block/fs/etc > could try to revert the iter, so it might not be safe to initiate > async IO from the callback as is Good point, we may just want to do the iov iter revert before sending it to io-wq for retry. Seems prudent, and can't hurt. -- Jens Axboe