Re: (subset) [PATCH 00/11] remove aux CQE caches

Jens Axboe <axboe@xxxxxxxxx> · Sun, 17 Mar 2024 19:34:30 -0600

On 3/17/24 6:15 PM, Ming Lei wrote:
> On Sun, Mar 17, 2024 at 04:24:07PM -0600, Jens Axboe wrote:
>> On 3/17/24 4:07 PM, Jens Axboe wrote:
>>> On 3/17/24 3:51 PM, Jens Axboe wrote:
>>>> On 3/17/24 3:47 PM, Pavel Begunkov wrote:
>>>>> On 3/17/24 21:34, Pavel Begunkov wrote:
>>>>>> On 3/17/24 21:32, Jens Axboe wrote:
>>>>>>> On 3/17/24 3:29 PM, Pavel Begunkov wrote:
>>>>>>>> On 3/17/24 21:24, Jens Axboe wrote:
>>>>>>>>> On 3/17/24 2:55 PM, Pavel Begunkov wrote:
>>>>>>>>>> On 3/16/24 13:56, Ming Lei wrote:
>>>>>>>>>>> On Sat, Mar 16, 2024 at 01:27:17PM +0000, Pavel Begunkov wrote:
>>>>>>>>>>>> On 3/16/24 11:52, Ming Lei wrote:
>>>>>>>>>>>>> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
>>>>>>>>>>>
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>>>> The following two error can be triggered with this patchset
>>>>>>>>>>>>> when running some ublk stress test(io vs. deletion). And not see
>>>>>>>>>>>>> such failures after reverting the 11 patches.
>>>>>>>>>>>>
>>>>>>>>>>>> I suppose it's with the fix from yesterday. How can I
>>>>>>>>>>>> reproduce it, blktests?
>>>>>>>>>>>
>>>>>>>>>>> Yeah, it needs yesterday's fix.
>>>>>>>>>>>
>>>>>>>>>>> You may need to run this test multiple times for triggering the problem:
>>>>>>>>>>
>>>>>>>>>> Thanks for all the testing. I've tried it, all ublk/generic tests hang
>>>>>>>>>> in userspace waiting for CQEs but no complaints from the kernel.
>>>>>>>>>> However, it seems the branch is buggy even without my patches, I
>>>>>>>>>> consistently (5-15 minutes of running in a slow VM) hit page underflow
>>>>>>>>>> by running liburing tests. Not sure what is that yet, but might also
>>>>>>>>>> be the reason.
>>>>>>>>>
>>>>>>>>> Hmm odd, there's nothing in there but your series and then the
>>>>>>>>> io_uring-6.9 bits pulled in. Maybe it hit an unfortunate point in the
>>>>>>>>> merge window -git cycle? Does it happen with io_uring-6.9 as well? I
>>>>>>>>> haven't seen anything odd.
>>>>>>>>
>>>>>>>> Need to test io_uring-6.9. I actually checked the branch twice, both
>>>>>>>> with the issue, and by full recompilation and config prompts I assumed
>>>>>>>> you pulled something in between (maybe not).
>>>>>>>>
>>>>>>>> And yeah, I can't confirm it's specifically an io_uring bug, the
>>>>>>>> stack trace is usually some unmap or task exit, sometimes it only
>>>>>>>> shows when you try to shutdown the VM after tests.
>>>>>>>
>>>>>>> Funky. I just ran a bunch of loops of liburing tests and Ming's ublksrv
>>>>>>> test case as well on io_uring-6.9 and it all worked fine. Trying
>>>>>>> liburing tests on for-6.10/io_uring as well now, but didn't see anything
>>>>>>> the other times I ran it. In any case, once you repost I'll rebase and
>>>>>>> then let's see if it hits again.
>>>>>>>
>>>>>>> Did you run with KASAN enabled
>>>>>>
>>>>>> Yes, it's a debug kernel, full on KASANs, lockdeps and so
>>>>>
>>>>> And another note, I triggered it once (IIRC on shutdown) with ublk
>>>>> tests only w/o liburing/tests, likely limits it to either the core
>>>>> io_uring infra or non-io_uring bugs.
>>>>
>>>> Been running on for-6.10/io_uring, and the only odd thing I see is that
>>>> the test output tends to stall here:
>>>>
>>>> Running test read-before-exit.t
>>>>
>>>> which then either leads to a connection disconnect from my ssh into that
>>>> vm, or just a long delay and then it picks up again. This did not happen
>>>> with io_uring-6.9.
>>>>
>>>> Maybe related? At least it's something new. Just checked again, and yeah
>>>> it seems to totally lock up the vm while that is running. Will try a
>>>> quick bisect of that series.
>>>
>>> Seems to be triggered by the top of branch patch in there, my poll and
>>> timeout special casing. While the above test case runs with that commit,
>>> it'll freeze the host.
>>
>> Had a feeling this was the busy looping off cancelations, and flushing
>> the fallback task_work seems to fix it. I'll check more tomorrow.
>>
>>
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index a2cb8da3cc33..f1d3c5e065e9 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -3242,6 +3242,8 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
>>  	ret |= io_kill_timeouts(ctx, task, cancel_all);
>>  	if (task)
>>  		ret |= io_run_task_work() > 0;
>> +	else if (ret)
>> +		flush_delayed_work(&ctx->fallback_work);
>>  	return ret;
>>  }
> 
> Still can trigger the warning with above patch:
> 
> [  446.275975] ------------[ cut here ]------------
> [  446.276340] WARNING: CPU: 8 PID: 731 at kernel/fork.c:969 __put_task_struct+0x10c/0x180

And this is running that test case you referenced? I'll take a look, as
it seems related to the poll kill rather than the other patchset.

-- 
Jens Axboe