On Wed, 5 Apr 2023, I wrote:
I don't care that much what dash does as long as it isn't corrupting it's own stack, which is a real possibility, and one which gdb's data watch point would normally resolve. And yet I have no way to tackle that. I've been running gdb under QEMU, where the failure is not reproducible. Running dash under gdb on real hardware is doable (RAM permitting). But the failure is intermittent even then -- it only happens during execution of certain init scripts, and I can't reproduce it by manually running those scripts. (Even if I could reproduce the failure under gdb, instrumenting execution in gdb can alter timing in undesirable ways...)
Somewhat optimistically, I upgraded the RAM on this system to 36 MB so I can run dash under gdb (20 MB was not enough). But, as expected, the crash went away when I did so. Outside of gdb, I was able to reproduce the same failure with a clean build from the dash repo (commit b00288f). I can get a crash with optimization level -O1 and -O though it becomes even more rare. So it's easier to use Debian's build (-O2). One of the difficulties with the core dump is that it happens too late. After the canary check fails, __stack_chk_fail() is called, which then calls a bunch of other stuff until finally abort() is called. This obliterates whatever was below the stack pointer at the time of the failure. So I modified libc.so.6 and now it just crashes with an illegal instruction in __wait3 rather than branching to __stack_chk_fail. This let me see whatever was left behind in stack memory by __wait4_time64() etc. __wait4_time64() calls __m68k_read_tp(), and the return address from __m68k_read_tp() can still be seen in stack memory, which suggests that the stack never grew after that call. (So __m68k_read_tp() is implicated.) Would signal delivery erase any of the memory immediately below the USP? If so, it would erase those old stack frames, which would give some indication of the timing of signal delivery. If I run dash under gdb under QEMU, I can break on entry to onsig() and find the signal frame on the stack. But when I examine stack memory from the core dump, I can't find 0x70774e40 (i.e. moveq __NR_sigreturn,%d0 ; trap #0) which the kernel puts on the stack in my QEMU experiments. That suggests that no signal was delivered... and yet gotsigchld == 1 at the time of the coredump, after having been initialized by waitproc() prior to calling __wait3(). So the signal handler onsig() must have executed during __wait3() or __wait4_time64(). I can't explain this.