Re: core dump analysis, was Re: stack smashing detected

Michael Schmitz <schmitzmic@xxxxxxxxx> · Mon, 24 Apr 2023 09:26:12 +1200

Hi Finn,

On 23/04/23 19:46, Finn Thain wrote:
On Wed, 19 Apr 2023, Michael Schmitz wrote:

I wonder what we'd see if we patched the kernel to log every user data
write fault caused by a MOVEM instruction. I'll try to code that up.
If these instructions did always cause stack corruption on 030, I think
we would have noticed long ago?

I think it probably was noticed long ago, in the form of rare userland
crashes on 68030. But it was probably never reported because the actual
culprit is too distant from the symptoms.

But I take your point -- signal delivery seems to be crucial. Would it be
difficult to skip signal delivery following a bus error? Perhaps there's
no need to try that experiment, as we know what would happen.
Shouldn't be too hard, see my other mail.
I will take a look at your modified test program and try to use the output
to figure out the stack gymnastics.

IIUC, there are two RTEs following the page fault. The first one runs the
signal handler, the second one resumes the MOVEM that faulted. Maybe we'll
have to intercept the latter (at do_sigreturn() perhaps?) and examine that
exception frame.

There's no second RTE as far as I can see - upon return from buserr_c, 
the asm buserr handler jumps to ret_from_exception. Seeing as the bus 
error was taken from user space, ret_from_exception proceeds to 
resume_userspace, and seeing the task info flags field non-zero, jumps 
to exit_work where with signal pending, a jump to do_signal_return is 
taken and the signal handler is set up (frame setup to return through 
the sigreturn trampoline, pc set to hander etc). No rte anywhere on that 
path. After setting up for the signal handler, we return to 
resume_userspace and no further signals pending, hit RESTORE_ALL which 
restores registes from the pt_regs struct on the kernel stack, and has 
the rte instruction at the end. We had earlier set usp to the signal 
frame and pc to the signal handler, so that is now run after resuming 
user mode after the rte instruction.

Exiting from the signal handler, sys_sigreturn runs and cleans up the 
user stack, then returns to the instruction at the pc from the saved 
exception frame that got us into kernel mode in the first instance. This 
is the moment the moveml instruction resumes.

There should be no difference between ret_from_exception (after buserr) 
jumping to RESTORE_ALL directly (with exception frame still on the 
kernel stack from the bus error exception) and doing so after the detour 
through signal hander setup, signal handler and sys_sigreturn cleanup. 
If the exception frame on the stack was any different from what it ought 
to be, rte would fail and raise a format error exception.

If the frame was different from that needed to complete the bus error 
exception, f.e. one from a trap exception, we'd fail to resume that 
moveml instruction and do something else instead. Hmmm - that's an 
interesting fault mode... might explain why a3 wasn't saved as it ought 
to have been? Can we 'poison' the user stack area that will be used for 
register save upon rec() entry with some other patterns to prove that 
moveml sometimes does not complete after the bus error?

Cheers,

    Michael