On Sun, Mar 17, 2024 at 07:34:30PM -0600, Jens Axboe wrote: > On 3/17/24 6:15 PM, Ming Lei wrote: > > On Sun, Mar 17, 2024 at 04:24:07PM -0600, Jens Axboe wrote: > >> On 3/17/24 4:07 PM, Jens Axboe wrote: > >>> On 3/17/24 3:51 PM, Jens Axboe wrote: > >>>> On 3/17/24 3:47 PM, Pavel Begunkov wrote: > >>>>> On 3/17/24 21:34, Pavel Begunkov wrote: > >>>>>> On 3/17/24 21:32, Jens Axboe wrote: > >>>>>>> On 3/17/24 3:29 PM, Pavel Begunkov wrote: > >>>>>>>> On 3/17/24 21:24, Jens Axboe wrote: > >>>>>>>>> On 3/17/24 2:55 PM, Pavel Begunkov wrote: > >>>>>>>>>> On 3/16/24 13:56, Ming Lei wrote: > >>>>>>>>>>> On Sat, Mar 16, 2024 at 01:27:17PM +0000, Pavel Begunkov wrote: > >>>>>>>>>>>> On 3/16/24 11:52, Ming Lei wrote: > >>>>>>>>>>>>> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote: > >>>>>>>>>>> > >>>>>>>>>>> ... > >>>>>>>>>>> > >>>>>>>>>>>>> The following two error can be triggered with this patchset > >>>>>>>>>>>>> when running some ublk stress test(io vs. deletion). And not see > >>>>>>>>>>>>> such failures after reverting the 11 patches. > >>>>>>>>>>>> > >>>>>>>>>>>> I suppose it's with the fix from yesterday. How can I > >>>>>>>>>>>> reproduce it, blktests? > >>>>>>>>>>> > >>>>>>>>>>> Yeah, it needs yesterday's fix. > >>>>>>>>>>> > >>>>>>>>>>> You may need to run this test multiple times for triggering the problem: > >>>>>>>>>> > >>>>>>>>>> Thanks for all the testing. I've tried it, all ublk/generic tests hang > >>>>>>>>>> in userspace waiting for CQEs but no complaints from the kernel. > >>>>>>>>>> However, it seems the branch is buggy even without my patches, I > >>>>>>>>>> consistently (5-15 minutes of running in a slow VM) hit page underflow > >>>>>>>>>> by running liburing tests. Not sure what is that yet, but might also > >>>>>>>>>> be the reason. > >>>>>>>>> > >>>>>>>>> Hmm odd, there's nothing in there but your series and then the > >>>>>>>>> io_uring-6.9 bits pulled in. Maybe it hit an unfortunate point in the > >>>>>>>>> merge window -git cycle? Does it happen with io_uring-6.9 as well? I > >>>>>>>>> haven't seen anything odd. > >>>>>>>> > >>>>>>>> Need to test io_uring-6.9. I actually checked the branch twice, both > >>>>>>>> with the issue, and by full recompilation and config prompts I assumed > >>>>>>>> you pulled something in between (maybe not). > >>>>>>>> > >>>>>>>> And yeah, I can't confirm it's specifically an io_uring bug, the > >>>>>>>> stack trace is usually some unmap or task exit, sometimes it only > >>>>>>>> shows when you try to shutdown the VM after tests. > >>>>>>> > >>>>>>> Funky. I just ran a bunch of loops of liburing tests and Ming's ublksrv > >>>>>>> test case as well on io_uring-6.9 and it all worked fine. Trying > >>>>>>> liburing tests on for-6.10/io_uring as well now, but didn't see anything > >>>>>>> the other times I ran it. In any case, once you repost I'll rebase and > >>>>>>> then let's see if it hits again. > >>>>>>> > >>>>>>> Did you run with KASAN enabled > >>>>>> > >>>>>> Yes, it's a debug kernel, full on KASANs, lockdeps and so > >>>>> > >>>>> And another note, I triggered it once (IIRC on shutdown) with ublk > >>>>> tests only w/o liburing/tests, likely limits it to either the core > >>>>> io_uring infra or non-io_uring bugs. > >>>> > >>>> Been running on for-6.10/io_uring, and the only odd thing I see is that > >>>> the test output tends to stall here: > >>>> > >>>> Running test read-before-exit.t > >>>> > >>>> which then either leads to a connection disconnect from my ssh into that > >>>> vm, or just a long delay and then it picks up again. This did not happen > >>>> with io_uring-6.9. > >>>> > >>>> Maybe related? At least it's something new. Just checked again, and yeah > >>>> it seems to totally lock up the vm while that is running. Will try a > >>>> quick bisect of that series. > >>> > >>> Seems to be triggered by the top of branch patch in there, my poll and > >>> timeout special casing. While the above test case runs with that commit, > >>> it'll freeze the host. > >> > >> Had a feeling this was the busy looping off cancelations, and flushing > >> the fallback task_work seems to fix it. I'll check more tomorrow. > >> > >> > >> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > >> index a2cb8da3cc33..f1d3c5e065e9 100644 > >> --- a/io_uring/io_uring.c > >> +++ b/io_uring/io_uring.c > >> @@ -3242,6 +3242,8 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx, > >> ret |= io_kill_timeouts(ctx, task, cancel_all); > >> if (task) > >> ret |= io_run_task_work() > 0; > >> + else if (ret) > >> + flush_delayed_work(&ctx->fallback_work); > >> return ret; > >> } > > > > Still can trigger the warning with above patch: > > > > [ 446.275975] ------------[ cut here ]------------ > > [ 446.276340] WARNING: CPU: 8 PID: 731 at kernel/fork.c:969 __put_task_struct+0x10c/0x180 > > And this is running that test case you referenced? I'll take a look, as > it seems related to the poll kill rather than the other patchset. Yeah, and now I am running 'git bisect' on Pavel's V2. thanks, Ming