Re: [6.5-rc5 regression] core dump hangs (was Re: [Bug report] fstests generic/051 (on xfs) hang on latest linux v6.5-rc5+)

Jens Axboe <axboe@xxxxxxxxx> · Mon, 12 Jun 2023 10:38:39 -0600

On 6/12/23 10:27?AM, Jens Axboe wrote:
> On 6/12/23 9:56?AM, Linus Torvalds wrote:
>> On Mon, Jun 12, 2023 at 8:36?AM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
>>>
>>>> Or maybe Darrick (who doesn't see the issue) is running on raw
>>>> hardware, and you and Zorro are running in a virtual environment?
>>>
>>> Ahah, it turns out that liburing-dev isn't installed on the test fleet,
>>> so fstests didn't get built with io_uring support.  That probably
>>> explains why I don't see any of these hangs.
>>>
>>> Oh.  I can't *install* the debian liburing-dev package because it has
>>> a versioned dependency on linux-libc-dev >= 5.1, which isn't compatible
>>> with me having a linux-libc-dev-djwong package that contains the uapi
>>> headers for the latest upstream kernel and Replaces: linux-libc-dev.
>>> So either I have to create a dummy linux-libc-dev with adequate version
>>> number that pulls in my own libc header package, or rename that package.
>>>
>>> <sigh> It's going to take me a while to research how best to split this
>>> stupid knot.
>>
>> Oh, no, that's great. It explains why you don't see the problem, and
>> Dave and Zorro do. Perfect.
>>
>> No need for you to install any liburing packages, at least for this
>> issue. You'll probably want it eventually just for test coverage, but
>> for now it's the smoking gun we wanted - I was looking at why vhost
>> would be impacted, because that commit so intentionally *tried* to not
>> do anything at all to io_uring.
>>
>> But it obviously failed. Which then in turn explains the bug.
>>
>> Not that I see exactly where it went wrong yet, but at least we're
>> looking at the right thing. Adding Jens to the participants, in case
>> he sees what goes wrong.
>>
>> Jens, commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix
>> freezer/ps regression") seems to have broken core dumping with
>> io_uring threads, even though it tried very hard not to. See
>>
>>   https://lore.kernel.org/all/20230611124836.whfktwaumnefm5z5@zlang-mailbox/
>>
>> for the beginning of this thread.
>>
>> Honestly, that "try to not change io_uring" was my least favorite part
>> of that patch, because I really think we want to try to aim for these
>> user helper threads having as much infrastructure in common as
>> possible. And when it comes to core dumps, I do not believe that
>> waiting for the io_uring thread adds anything to the end result,
>> because the only reason we wait for it is to put in the thread
>> register state into the core dump, and for kernel helper threads, that
>> information just isn't useful. It's going to be the state that caused
>> the thread to be created, not anything that is worth saving in a core
>> dump for.
>>
>> So I'd actually prefer to just simplify the logic entirely, and say
>> "PF_USER_WORKER tasks do not participate in core dumps, end of story".
>> io_uring didn't _care_, so including them wasn't a pain, but if the
>> vhost exit case can be delayed, I'd rather just say "let's do thig
>> thing for both io_uring and vhost, and not split those two cases up".
>>
>> Anyway, I don't see exactly what goes wrong, but I feel better just
>> from this having been narrowed down to io_uring threads. I suspect
>> Jens actually might even have a core-dumping test-case somewhere,
>> since core dumping was a thing that io_uring ended up having some
>> issues with at one point.
> 
> I'll take a look - at the exact same time as your email, someone just
> reported this issue separately on the liburing GH page as well. Tried
> myself, and yup, anything that ends up spawning an io-wq worker and then
> core dumps will now get stuck:
> 
> [  136.271250] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  136.271711] task:ih              state:D stack:0     pid:736   ppid:727    flags:0x00000004
> [  136.272218] Call trace:
> [  136.272353]  __switch_to+0xb0/0xc8
> [  136.272555]  __schedule+0x528/0x584
> [  136.272757]  schedule+0x4c/0x90
> [  136.272936]  schedule_timeout+0x30/0xdc
> [  136.273179]  __wait_for_common+0x8c/0x118
> [  136.273407]  wait_for_completion_state+0x1c/0x30
> [  136.273686]  do_coredump+0x334/0x1000
> [  136.273898]  get_signal+0x19c/0x5d8
> [  136.274108]  do_notify_resume+0x10c/0xa0c
> [  136.274346]  el0_da+0x50/0x5c
> [  136.274555]  el0t_64_sync_handler+0xb8/0x134
> [  136.274812]  el0t_64_sync+0x168/0x16c
> 
> Not good... I don't immediately see what the issue is, but I'll poke
> shortly after a few meetings.

Quick peek would suggest that it's because io-wq clears PF_IO_WORKER on
exit, and now we fail the check in coredump_task_exit() that was added.

>From my quick recollection, this is to avoid hitting the schedule out
callback on exit. But I could be totally wrong... In any case, I'd be
surprised if this isn't why it got broken by Mike's patch.

-- 
Jens Axboe