Re: reliable reproducer, was Re: core dump analysis

Finn Thain <fthain@xxxxxxxxxxxxxx> · Thu, 20 Apr 2023 15:17:08 +1000 (AEST)

On Thu, 20 Apr 2023, Michael Schmitz wrote:

As with dash, the corruption lies the page boundary.

Hence implies a page fault handled at the page boundary.

Can you try and fault in as many of these stack pages as possible, ahead 
of filling the stack? (Depending on how much RAM you have ...). Maybe we 
would need to lock those pages into memory? Just to show that with no 
page faults (but still signals) there is no corruption?

I modified the test program to execute rec() to full depth with no 
forking, then do it again with forking.

root@(none):/root# while ./stack-test 5000 ; do : ; done
starting recursion
done.
starting recursion with fork
done.
starting recursion
done.
starting recursion with fork
Illegal instruction
root@(none):/root# 

I can't get this to crash during the first descent. The second descent 
always crashes, given sufficient depth:

root@(none):/root# while ./stack-test 50000 ; do : ; done
starting recursion
done.
starting recursion with fork
Illegal instruction

So all the stack pages would have been faulted in well before the failure 
shows up. It appears to be the signal that's the problem and not the page 
fault. That's not surprising considering the PC in the signal frame in the 
dash crash was a MOVEM saving registers onto the stack.

It's worth noting that the test program never crashes with a corrupted 
return address. Random corruption would have clobbered that address about 
10% of the time, since the entire rec() stack frame is 9 long words. So it 
must be that a MOVEM went awry when a signal got delivered.