On Sun, 2 Apr 2023, Michael Schmitz wrote:
Saved registers are restored from the stack before return from __GI___wait4_time64 but we don't know which of the two wait4 call sites was used, do we? What registers does __m68k_read_tp@plt clobber?
But that won't matter to the caller, __wait3, right? I did check that %a3 was saved on entry, before any wait4 syscall or __m68k_read_tp call etc. I also looked at the rts and %a3 did get restored there. Is it worth the effort to trace every branch, in case there's some way to reach an rts without having first restored the saved registers?
Maybe an interaction between (multiple?) signals and syscall return...
When running dash from gdb in QEMU, there's only one signal (SIGCHLD) and it gets handled before __wait3() returns. (Of course, the "stack smashing detected" failure never shows up in QEMU.)
depends on how long we sleep in wait4, and whether a signal happens just during that time.
I agree, there seems to be a race condition there. (And dash's waitproc() seems to take pains to reap the child and handle the signal in any order.) I wouldn't be surprised if this race somehow makes the failure rare. I don't want to recompile any userland binaries at this stage, so it would be nice if we could modify the kernel to keep track of exactly how that race gets won and lost. Or perhaps there's an easy way to rig the outcome one way or the other.
%a3 is the first register saved to the switch stack BTW. That kernel does contain Al Viro's patch that corrected our switch stack handling in the signal return path? I wonder whether there's a potential race lurking in there?
I'm not sure which patch you're referring to, but I think Al's signal handling work appeared in v5.15-rc4. I have reproduced the "stack smashing detected" failure with v5.14.0 and with recent mainline (62bad54b26db from March 30th).
And I just notice that we had had trouble with a copy_to_user in setup_frame() earlier (reason for my buserr handler patch). I wonder whether something's gone wrong there. Do you get a segfault instead of the abort signal if you drop my patch?
Are you referring to e36a82bebbf7? I doubt that it's related. I believe that copy_to_user is not involved here for the reason already given i.e. wait3(status, flags, NULL) means wait4 gets a NULL pointer for the struct rusage * parameter. Also, Stan first reported this failure in December with v6.0.9.