Re: stack smashing detected

Michael Schmitz <schmitzmic@xxxxxxxxx> · Thu, 2 Feb 2023 07:51:30 +1300

Hi Stan,

On 2/02/23 05:38, Stan Johnson wrote:
On 1/30/23 8:05 PM, Michael Schmitz wrote:
...
Am 30.01.2023 um 17:00 schrieb Stan Johnson:
Hello,

I am seeing anywhere from zero to four of the following errors while
booting Linux on 68030 systems and using sysvinit startup scripts:

*** stack smashing detected ***: terminated
Aborted

I usually (but not always) see three of the errors while init is running
the rcS.d scripts, and one while running the rc2.d scripts. The stack
smashing messages appear only on the system console (nothing is logged
in an error log or dmesg). Despite the errors, the system continues
booting to multiuser mode without any obvious additional problems. I
haven't tested systemd, which is too slow to be useful on my m68k
systems (though I have a Debian SID with systemd that I can restore for
testing if necessary).

...
Another way may be logging the start of each of the rcS.d or rc2.d
scripts until you know what scripts to look at in more detail, then
adding 'set -v' at the start of those to log every command in the
offending script.
Hi Michael,

Thanks for your reply.

After logging the start and end of each script, I see that the "stack
smashing detected" error often happens while running
"/etc/rcS.d/S01mountkernfs.sh" (/etc/init.d/mountkernfs.sh). I'll try to
isolate it to a particular command.

This may be a coincidence, but the error seems to happen (up to about 4
times) after a warm boot into Mac OS 7.5.5 and a subsequent boot into
Linux that when starting with a cold boot into Mac OS 7.5.5, but it
doesn't seem that that should make any difference for Linux. This
morning, after a cold boot, I saw two of the errors, while after a warm
boot, I saw four.
Hmm - that might well indicate a hardware issue rather than software. 
Bits flipping at random in RAM (and getting picked up because the stack 
canary changes).

Once the offending binary is known (and the crash can be reproduced
after system boot), gdb can be used to find the function that overwrote
its local stack guard.
Is there a way to configure the kernel to use the stack guard for every
function, and then log every resulting abort? I realize that that would
be very slow, but it might also be useful for debugging.

The stack canary mechanism pushes a token on the stack at function 
entry, and compares against that token's value at function exit. This is 
all code generated by gcc in the user binary.

The kernel is not involved in function calls other than syscalls. For 
syscalls, we could try to find the user mode stack, and do a similar 
canary trick, but I don't think that would be necessary for all 
syscalls. Might be easier to instrument copy_to_user() instead if you're 
worried about a syscall receiving result data that way to a variable on 
the stack.

But since we're touching on copy_to_user() here - the 'remove set_fs' 
patch set by Christoph Hellwig refactored the m68k inline helpers around 
July 2021. Can you test a kernel prior to those patches (5.15-rc2)?

That's a lot of work on a 030 Mac - have you tried to reproduce this on
any kind of emulator?
I haven't seen the error in QEMU.

I suppose one difference between your 030 and 040 Macs might be the
amount of RAM available. I wonder if this bug results from a combination
of 030 MMU and memory pressure, or 030 MMU only.
For some reason, the error seems to happen only with 68030 systems,
regardless of processor speed or memory:

PB 170      : 68030, 25 MHz, 8 MiB, external SCSI2SD
Mac IIci    : 68030, 25 MHz, 80 MiB, internal SCSI2SD
SE/30       : 68030, 16 MHz, 128 MiB, external SCSI2SD
PB 550c     : 68040, 33 MHz, 36 MiB, external SCSI2SD
Centris 650 : 68040, 25 MHz, 136 MiB, internal SCSI2SD

Exception handling in copy_to_user() and the related bits in 030 page 
fault handling might need another look in then...

Cheers,

    Michael

-Stan