Hi Finn,
reproduced on my Falcon (with minor mods to the C source - my version of
gcc didn't like asm with no clobbers, so I added "memory" as clobber in
the second asm block). In this case it's a4 that is corrupted, but that
varies.
depth of 4096 gets me two core dumps on 20 attempts so this isn't quite
as fast on my Falcon. With 8192, it's nine.
Example:
Core was generated by `./moveml'.
Program terminated with signal 4, Illegal instruction.
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld.so.1...done.
Loaded symbols for /lib/ld.so.1
#0 0x8000060e in rec ()
(gdb) info reg
d0 0x8000057c -2147482244
d1 0xc0017000 -1073647616
d2 0xd1d2d3d4 -774712364
d3 0xe1e2e3e4 -505224220
d4 0xf1f2f3f4 -235736076
d5 0x80096168 -2146868888
d6 0x80093108 -2146881272
d7 0x0 0
a0 0x0 0x0
a1 0xefdadbdc 0xefdadbdc
a2 0x91929394 0x91929394
a3 0xa1a2a3a4 0xa1a2a3a4
a4 0x8000057c 0x8000057c
a5 0xc1c2c3c4 0xc1c2c3c4
fp 0xef87402c 0xef87402c
sp 0xef874010 0xef874010
ps 0x209 521
pc 0x8000060e 0x8000060e <rec+242>
fpcontrol 0x0 0
fpstatus 0x0 0
fpiaddr 0x0 0
(gdb)
Am 20.04.2023 um 14:57 schrieb Finn Thain:
On Thu, 20 Apr 2023, Michael Schmitz wrote:
Can you try and fault in as many of these stack pages as possible, ahead
of filling the stack? (Depending on how much RAM you have ...). Maybe we
would need to lock those pages into memory? Just to show that with no
page faults (but still signals) there is no corruption?
OK.
Any signal frames or exception frames have been completely overwritten
because the recursion continued after the corruption took place. So
there's not much to see in the core dump.
We'd need a way to stop recursion once the first corruption has taken
place. If the 'safe' recursion depth of 10131 is constant, the dump
taken at that point should look similar to what you saw in dash
(assuming it is the page fault and subsequent signal return that causes
the corruption).
It turns out that the recursion depth can be set a lot lower than the
200000 that I chose in that test program. (I used that value as it kept
the stack size just below the default 8192 kB limit.)
And it does keep the core a lot smaller. Still not hard to work with on
my 14 MB RAM Falcon...
At depth = 2500, a failure is around 95% certain. At depth = 2048 I can
still get an intermittent failure. This only required 21 stack pagefaults
and one fork.
I suspect that the location of the corruption is probably somewhat random,
and the larger the stack happens to be when the signal comes in, the
better the odds of detection.
Yep, but there must me some more to that. Timing of page faults due to
swap bandwidth, perhaps?
Cheers,
Michael