Re: [bug report] WARNING: CPU: 5 PID: 679 at io_uring/io_uring.c:2835 io_ring_exit_work+0x2b6/0x2e0

Pavel Begunkov <asml.silence@xxxxxxxxx> · Tue, 16 Apr 2024 13:40:45 +0100

On 4/16/24 13:24, Jens Axboe wrote:
On 4/16/24 6:14 AM, Pavel Begunkov wrote:
On 4/16/24 12:38, Jens Axboe wrote:
On 4/16/24 4:00 AM, Ming Lei wrote:
On Tue, Apr 16, 2024 at 10:26:16AM +0800, Changhui Zhong wrote:

I can't reproduce this here, fwiw. Ming, something you've seen?

I just test against the latest for-next/block(-rc4 based), and still can't
reproduce it. There was such RH internal report before, and maybe not
ublk related.

Changhui, if the issue can be reproduced in your machine, care to share
your machine for me to investigate a bit?

Thanks,
Ming

I still can reproduce this issue on my machine?
and I shared machine to Ming?he can do more investigation for this issue?

[ 1244.207092] running generic/006
[ 1246.456896] blk_print_req_error: 77 callbacks suppressed
[ 1246.456907] I/O error, dev ublkb1, sector 2395864 op 0x1:(WRITE)
flags 0x8800 phys_seg 1 prio class 0

The failure is actually triggered in recovering qcow2 target in generic/005,
since ublkb0 isn't removed successfully in generic/005.

git-bisect shows that the 1st bad commit is cca6571381a0 ("io_uring/rw:
cleanup retry path").

And not see any issue in uring command side, so the trouble seems
in normal io_uring rw side over XFS file, and not related with block
device.

Indeed, I can reproduce it on XFS as well. I'll take a look.

Looking at this patch, that io_rw_should_reissue() path is for when
we failed via the kiocb callback but came there off of the submission
path, so when we unwind back it finds the flag, preps and resubmits
the req. If it's not the case but we still return "true", it'd leaks
the request, which would explains why exit_work gets stuck.

Yep, this is what happens. I have a test patch that just punts any
reissue to task_work, it'll insert to iowq from there. Before we would
fail it, even though we didn't have to, but that check was killed and
then it just lingers for this case and it's lost.

Sounds good, but let me note that while unwinding, block/fs/etc
could try to revert the iter, so it might not be safe to initiate
async IO from the callback as is

--
Pavel Begunkov