Re: core dump analysis, was Re: stack smashing detected

Finn Thain <fthain@xxxxxxxxxxxxxx> · Fri, 14 Apr 2023 19:30:56 +1000 (AEST)

On Wed, 5 Apr 2023, I wrote:

I don't care that much what dash does as long as it isn't corrupting 
it's own stack, which is a real possibility, and one which gdb's data 
watch point would normally resolve. And yet I have no way to tackle 
that.

I've been running gdb under QEMU, where the failure is not reproducible. 
Running dash under gdb on real hardware is doable (RAM permitting). But 
the failure is intermittent even then -- it only happens during 
execution of certain init scripts, and I can't reproduce it by manually 
running those scripts.

(Even if I could reproduce the failure under gdb, instrumenting 
execution in gdb can alter timing in undesirable ways...)

Somewhat optimistically, I upgraded the RAM on this system to 36 MB so I 
can run dash under gdb (20 MB was not enough). But, as expected, the crash 
went away when I did so.

Outside of gdb, I was able to reproduce the same failure with a clean 
build from the dash repo (commit b00288f). I can get a crash with 
optimization level -O1 and -O though it becomes even more rare. So it's 
easier to use Debian's build (-O2).

One of the difficulties with the core dump is that it happens too late. 
After the canary check fails, __stack_chk_fail() is called, which then 
calls a bunch of other stuff until finally abort() is called. This 
obliterates whatever was below the stack pointer at the time of the 
failure.

So I modified libc.so.6 and now it just crashes with an illegal 
instruction in __wait3 rather than branching to __stack_chk_fail. This let 
me see whatever was left behind in stack memory by __wait4_time64() etc.

__wait4_time64() calls __m68k_read_tp(), and the return address from 
__m68k_read_tp() can still be seen in stack memory, which suggests that 
the stack never grew after that call. (So __m68k_read_tp() is implicated.)

Would signal delivery erase any of the memory immediately below the USP? 
If so, it would erase those old stack frames, which would give some 
indication of the timing of signal delivery.

If I run dash under gdb under QEMU, I can break on entry to onsig() and 
find the signal frame on the stack. But when I examine stack memory from 
the core dump, I can't find 0x70774e40 (i.e. moveq __NR_sigreturn,%d0 ; 
trap #0) which the kernel puts on the stack in my QEMU experiments.

That suggests that no signal was delivered... and yet gotsigchld == 1 at 
the time of the coredump, after having been initialized by waitproc() 
prior to calling __wait3(). So the signal handler onsig() must have 
executed during __wait3() or __wait4_time64(). I can't explain this.