On 3/17/24 6:15 PM, Ming Lei wrote: > On Sun, Mar 17, 2024 at 04:24:07PM -0600, Jens Axboe wrote: >> On 3/17/24 4:07 PM, Jens Axboe wrote: >>> On 3/17/24 3:51 PM, Jens Axboe wrote: >>>> On 3/17/24 3:47 PM, Pavel Begunkov wrote: >>>>> On 3/17/24 21:34, Pavel Begunkov wrote: >>>>>> On 3/17/24 21:32, Jens Axboe wrote: >>>>>>> On 3/17/24 3:29 PM, Pavel Begunkov wrote: >>>>>>>> On 3/17/24 21:24, Jens Axboe wrote: >>>>>>>>> On 3/17/24 2:55 PM, Pavel Begunkov wrote: >>>>>>>>>> On 3/16/24 13:56, Ming Lei wrote: >>>>>>>>>>> On Sat, Mar 16, 2024 at 01:27:17PM +0000, Pavel Begunkov wrote: >>>>>>>>>>>> On 3/16/24 11:52, Ming Lei wrote: >>>>>>>>>>>>> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote: >>>>>>>>>>> >>>>>>>>>>> ... >>>>>>>>>>> >>>>>>>>>>>>> The following two error can be triggered with this patchset >>>>>>>>>>>>> when running some ublk stress test(io vs. deletion). And not see >>>>>>>>>>>>> such failures after reverting the 11 patches. >>>>>>>>>>>> >>>>>>>>>>>> I suppose it's with the fix from yesterday. How can I >>>>>>>>>>>> reproduce it, blktests? >>>>>>>>>>> >>>>>>>>>>> Yeah, it needs yesterday's fix. >>>>>>>>>>> >>>>>>>>>>> You may need to run this test multiple times for triggering the problem: >>>>>>>>>> >>>>>>>>>> Thanks for all the testing. I've tried it, all ublk/generic tests hang >>>>>>>>>> in userspace waiting for CQEs but no complaints from the kernel. >>>>>>>>>> However, it seems the branch is buggy even without my patches, I >>>>>>>>>> consistently (5-15 minutes of running in a slow VM) hit page underflow >>>>>>>>>> by running liburing tests. Not sure what is that yet, but might also >>>>>>>>>> be the reason. >>>>>>>>> >>>>>>>>> Hmm odd, there's nothing in there but your series and then the >>>>>>>>> io_uring-6.9 bits pulled in. Maybe it hit an unfortunate point in the >>>>>>>>> merge window -git cycle? Does it happen with io_uring-6.9 as well? I >>>>>>>>> haven't seen anything odd. >>>>>>>> >>>>>>>> Need to test io_uring-6.9. I actually checked the branch twice, both >>>>>>>> with the issue, and by full recompilation and config prompts I assumed >>>>>>>> you pulled something in between (maybe not). >>>>>>>> >>>>>>>> And yeah, I can't confirm it's specifically an io_uring bug, the >>>>>>>> stack trace is usually some unmap or task exit, sometimes it only >>>>>>>> shows when you try to shutdown the VM after tests. >>>>>>> >>>>>>> Funky. I just ran a bunch of loops of liburing tests and Ming's ublksrv >>>>>>> test case as well on io_uring-6.9 and it all worked fine. Trying >>>>>>> liburing tests on for-6.10/io_uring as well now, but didn't see anything >>>>>>> the other times I ran it. In any case, once you repost I'll rebase and >>>>>>> then let's see if it hits again. >>>>>>> >>>>>>> Did you run with KASAN enabled >>>>>> >>>>>> Yes, it's a debug kernel, full on KASANs, lockdeps and so >>>>> >>>>> And another note, I triggered it once (IIRC on shutdown) with ublk >>>>> tests only w/o liburing/tests, likely limits it to either the core >>>>> io_uring infra or non-io_uring bugs. >>>> >>>> Been running on for-6.10/io_uring, and the only odd thing I see is that >>>> the test output tends to stall here: >>>> >>>> Running test read-before-exit.t >>>> >>>> which then either leads to a connection disconnect from my ssh into that >>>> vm, or just a long delay and then it picks up again. This did not happen >>>> with io_uring-6.9. >>>> >>>> Maybe related? At least it's something new. Just checked again, and yeah >>>> it seems to totally lock up the vm while that is running. Will try a >>>> quick bisect of that series. >>> >>> Seems to be triggered by the top of branch patch in there, my poll and >>> timeout special casing. While the above test case runs with that commit, >>> it'll freeze the host. >> >> Had a feeling this was the busy looping off cancelations, and flushing >> the fallback task_work seems to fix it. I'll check more tomorrow. >> >> >> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c >> index a2cb8da3cc33..f1d3c5e065e9 100644 >> --- a/io_uring/io_uring.c >> +++ b/io_uring/io_uring.c >> @@ -3242,6 +3242,8 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx, >> ret |= io_kill_timeouts(ctx, task, cancel_all); >> if (task) >> ret |= io_run_task_work() > 0; >> + else if (ret) >> + flush_delayed_work(&ctx->fallback_work); >> return ret; >> } > > Still can trigger the warning with above patch: > > [ 446.275975] ------------[ cut here ]------------ > [ 446.276340] WARNING: CPU: 8 PID: 731 at kernel/fork.c:969 __put_task_struct+0x10c/0x180 And this is running that test case you referenced? I'll take a look, as it seems related to the poll kill rather than the other patchset. -- Jens Axboe