Re: reliable reproducer, was Re: core dump analysis

Finn Thain <fthain@xxxxxxxxxxxxxx> · Thu, 20 Apr 2023 12:57:24 +1000 (AEST)

On Thu, 20 Apr 2023, Michael Schmitz wrote:

Can you try and fault in as many of these stack pages as possible, ahead 
of filling the stack? (Depending on how much RAM you have ...). Maybe we 
would need to lock those pages into memory? Just to show that with no 
page faults (but still signals) there is no corruption?

OK.

Any signal frames or exception frames have been completely overwritten 
because the recursion continued after the corruption took place. So 
there's not much to see in the core dump.

We'd need a way to stop recursion once the first corruption has taken 
place. If the 'safe' recursion depth of 10131 is constant, the dump 
taken at that point should look similar to what you saw in dash 
(assuming it is the page fault and subsequent signal return that causes 
the corruption).

It turns out that the recursion depth can be set a lot lower than the 
200000 that I chose in that test program. (I used that value as it kept 
the stack size just below the default 8192 kB limit.)

At depth = 2500, a failure is around 95% certain. At depth = 2048 I can 
still get an intermittent failure. This only required 21 stack pagefaults 
and one fork.

I suspect that the location of the corruption is probably somewhat random, 
and the larger the stack happens to be when the signal comes in, the 
better the odds of detection.