Re: segfaults of processes while being killed after commit "mm: make the page fault mmap locking killable"

Thomas Lamprecht <t.lamprecht@xxxxxxxxxxx> · Wed, 26 Jul 2023 08:51:24 +0200

On 25/07/2023 18:38, Linus Torvalds wrote:
> But before we revert it, would you mind trying out the attached
> trivial patch instead?

Not Fiona, but as I was still online yesterday I got around to already
try that patch out, after adding the missing `tsk` task_struct param
to the fatal_signal_pending call.
With the patched kernel booted, the original case we found in the wild
went from logging a segfault roughly twice per hour before, to none
afterward, and that with a bit more than 10h of boot time.
Fiona might have a more definitive confirmation, as IIRC she got a
better (= faster) reproducer used for bisecting.

> 
> I'd also still be interested if the symptoms were anything else than
> 'show_unhandled_signals' causing the show_signal_msg() dance, and
> resulting in a message something like
> 
>     a.out[1567]: segfault at xyz ip [..] likely on CPU X
> 
> in dmesg...

exactly, it was just like that with no actual fall out. The messages
were like:

> pverados[2183248]: segfault at 55e5a00f9ae0 ip 000055e5a00f9ae0 sp 00007ffc0720bea8 error 14 in perl[55e5a00d4000+195000] likely on CPU 10 (core 4, socket 0)

And the slightly odd code triggering this was basically a fork, where
the child wrote a message to the parent via a unix socket pair and
then called exit. The parent read that message and then send a SIGKILL
to the child process, i.e., the child exit and parent killing the
child process would be pretty closely aligned, basically racing with
each other.

cheers,
 Thomas