On 25/07/2023 18:38, Linus Torvalds wrote: > But before we revert it, would you mind trying out the attached > trivial patch instead? Not Fiona, but as I was still online yesterday I got around to already try that patch out, after adding the missing `tsk` task_struct param to the fatal_signal_pending call. With the patched kernel booted, the original case we found in the wild went from logging a segfault roughly twice per hour before, to none afterward, and that with a bit more than 10h of boot time. Fiona might have a more definitive confirmation, as IIRC she got a better (= faster) reproducer used for bisecting. > > I'd also still be interested if the symptoms were anything else than > 'show_unhandled_signals' causing the show_signal_msg() dance, and > resulting in a message something like > > a.out[1567]: segfault at xyz ip [..] likely on CPU X > > in dmesg... exactly, it was just like that with no actual fall out. The messages were like: > pverados[2183248]: segfault at 55e5a00f9ae0 ip 000055e5a00f9ae0 sp 00007ffc0720bea8 error 14 in perl[55e5a00d4000+195000] likely on CPU 10 (core 4, socket 0) And the slightly odd code triggering this was basically a fork, where the child wrote a message to the parent via a unix socket pair and then called exit. The parent read that message and then send a SIGKILL to the child process, i.e., the child exit and parent killing the child process would be pretty closely aligned, basically racing with each other. cheers, Thomas