On Thu, 20 Apr 2023, Michael Schmitz wrote:
Can you try and fault in as many of these stack pages as possible, ahead
of filling the stack? (Depending on how much RAM you have ...). Maybe we
would need to lock those pages into memory? Just to show that with no
page faults (but still signals) there is no corruption?
OK.
Any signal frames or exception frames have been completely overwritten
because the recursion continued after the corruption took place. So
there's not much to see in the core dump.
We'd need a way to stop recursion once the first corruption has taken
place. If the 'safe' recursion depth of 10131 is constant, the dump
taken at that point should look similar to what you saw in dash
(assuming it is the page fault and subsequent signal return that causes
the corruption).
It turns out that the recursion depth can be set a lot lower than the
200000 that I chose in that test program. (I used that value as it kept
the stack size just below the default 8192 kB limit.)
At depth = 2500, a failure is around 95% certain. At depth = 2048 I can
still get an intermittent failure. This only required 21 stack pagefaults
and one fork.
I suspect that the location of the corruption is probably somewhat random,
and the larger the stack happens to be when the signal comes in, the
better the odds of detection.