Re: core dump analysis, was Re: stack smashing detected

Finn Thain <fthain@xxxxxxxxxxxxxx> · Sun, 2 Apr 2023 19:31:08 +1000 (AEST)

On Sun, 2 Apr 2023, Michael Schmitz wrote:

Saved registers are restored from the stack before return from 
__GI___wait4_time64 but we don't know which of the two wait4 call sites 
was used, do we?

What registers does __m68k_read_tp@plt clobber?

But that won't matter to the caller, __wait3, right? 

I did check that %a3 was saved on entry, before any wait4 syscall or 
__m68k_read_tp call etc. I also looked at the rts and %a3 did get restored 
there. Is it worth the effort to trace every branch, in case there's some 
way to reach an rts without having first restored the saved registers?

Maybe an interaction between (multiple?) signals and syscall return...

When running dash from gdb in QEMU, there's only one signal (SIGCHLD) and 
it gets handled before __wait3() returns. (Of course, the "stack smashing 
detected" failure never shows up in QEMU.)

depends on how long we sleep in wait4, and whether a signal happens just 
during that time.

I agree, there seems to be a race condition there. (And dash's waitproc() 
seems to take pains to reap the child and handle the signal in any order.) 
I wouldn't be surprised if this race somehow makes the failure rare.

I don't want to recompile any userland binaries at this stage, so it would 
be nice if we could modify the kernel to keep track of exactly how that 
race gets won and lost. Or perhaps there's an easy way to rig the outcome 
one way or the other.

%a3 is the first register saved to the switch stack BTW.

That kernel does contain Al Viro's patch that corrected our switch stack 
handling in the signal return path? I wonder whether there's a potential 
race lurking in there?

I'm not sure which patch you're referring to, but I think Al's signal 
handling work appeared in v5.15-rc4. I have reproduced the "stack smashing 
detected" failure with v5.14.0 and with recent mainline (62bad54b26db from 
March 30th).

And I just notice that we had had trouble with a copy_to_user in 
setup_frame() earlier (reason for my buserr handler patch). I wonder 
whether something's gone wrong there. Do you get a segfault instead of 
the abort signal if you drop my patch?

Are you referring to e36a82bebbf7? I doubt that it's related. I believe 
that copy_to_user is not involved here for the reason already given i.e. 
wait3(status, flags, NULL) means wait4 gets a NULL pointer for the struct 
rusage * parameter. Also, Stan first reported this failure in December 
with v6.0.9.