On 6/12/23 10:27?AM, Jens Axboe wrote: > On 6/12/23 9:56?AM, Linus Torvalds wrote: >> On Mon, Jun 12, 2023 at 8:36?AM Darrick J. Wong <djwong@xxxxxxxxxx> wrote: >>> >>>> Or maybe Darrick (who doesn't see the issue) is running on raw >>>> hardware, and you and Zorro are running in a virtual environment? >>> >>> Ahah, it turns out that liburing-dev isn't installed on the test fleet, >>> so fstests didn't get built with io_uring support. That probably >>> explains why I don't see any of these hangs. >>> >>> Oh. I can't *install* the debian liburing-dev package because it has >>> a versioned dependency on linux-libc-dev >= 5.1, which isn't compatible >>> with me having a linux-libc-dev-djwong package that contains the uapi >>> headers for the latest upstream kernel and Replaces: linux-libc-dev. >>> So either I have to create a dummy linux-libc-dev with adequate version >>> number that pulls in my own libc header package, or rename that package. >>> >>> <sigh> It's going to take me a while to research how best to split this >>> stupid knot. >> >> Oh, no, that's great. It explains why you don't see the problem, and >> Dave and Zorro do. Perfect. >> >> No need for you to install any liburing packages, at least for this >> issue. You'll probably want it eventually just for test coverage, but >> for now it's the smoking gun we wanted - I was looking at why vhost >> would be impacted, because that commit so intentionally *tried* to not >> do anything at all to io_uring. >> >> But it obviously failed. Which then in turn explains the bug. >> >> Not that I see exactly where it went wrong yet, but at least we're >> looking at the right thing. Adding Jens to the participants, in case >> he sees what goes wrong. >> >> Jens, commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix >> freezer/ps regression") seems to have broken core dumping with >> io_uring threads, even though it tried very hard not to. See >> >> https://lore.kernel.org/all/20230611124836.whfktwaumnefm5z5@zlang-mailbox/ >> >> for the beginning of this thread. >> >> Honestly, that "try to not change io_uring" was my least favorite part >> of that patch, because I really think we want to try to aim for these >> user helper threads having as much infrastructure in common as >> possible. And when it comes to core dumps, I do not believe that >> waiting for the io_uring thread adds anything to the end result, >> because the only reason we wait for it is to put in the thread >> register state into the core dump, and for kernel helper threads, that >> information just isn't useful. It's going to be the state that caused >> the thread to be created, not anything that is worth saving in a core >> dump for. >> >> So I'd actually prefer to just simplify the logic entirely, and say >> "PF_USER_WORKER tasks do not participate in core dumps, end of story". >> io_uring didn't _care_, so including them wasn't a pain, but if the >> vhost exit case can be delayed, I'd rather just say "let's do thig >> thing for both io_uring and vhost, and not split those two cases up". >> >> Anyway, I don't see exactly what goes wrong, but I feel better just >> from this having been narrowed down to io_uring threads. I suspect >> Jens actually might even have a core-dumping test-case somewhere, >> since core dumping was a thing that io_uring ended up having some >> issues with at one point. > > I'll take a look - at the exact same time as your email, someone just > reported this issue separately on the liburing GH page as well. Tried > myself, and yup, anything that ends up spawning an io-wq worker and then > core dumps will now get stuck: > > [ 136.271250] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 136.271711] task:ih state:D stack:0 pid:736 ppid:727 flags:0x00000004 > [ 136.272218] Call trace: > [ 136.272353] __switch_to+0xb0/0xc8 > [ 136.272555] __schedule+0x528/0x584 > [ 136.272757] schedule+0x4c/0x90 > [ 136.272936] schedule_timeout+0x30/0xdc > [ 136.273179] __wait_for_common+0x8c/0x118 > [ 136.273407] wait_for_completion_state+0x1c/0x30 > [ 136.273686] do_coredump+0x334/0x1000 > [ 136.273898] get_signal+0x19c/0x5d8 > [ 136.274108] do_notify_resume+0x10c/0xa0c > [ 136.274346] el0_da+0x50/0x5c > [ 136.274555] el0t_64_sync_handler+0xb8/0x134 > [ 136.274812] el0t_64_sync+0x168/0x16c > > Not good... I don't immediately see what the issue is, but I'll poke > shortly after a few meetings. Quick peek would suggest that it's because io-wq clears PF_IO_WORKER on exit, and now we fail the check in coredump_task_exit() that was added. >From my quick recollection, this is to avoid hitting the schedule out callback on exit. But I could be totally wrong... In any case, I'd be surprised if this isn't why it got broken by Mike's patch. -- Jens Axboe