Re: reliable reproducer, was Re: core dump analysis

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Finn,

reproduced on my Falcon (with minor mods to the C source - my version of gcc didn't like asm with no clobbers, so I added "memory" as clobber in the second asm block). In this case it's a4 that is corrupted, but that varies.

depth of 4096 gets me two core dumps on 20 attempts so this isn't quite as fast on my Falcon. With 8192, it's nine.

Example:

Core was generated by `./moveml'.
Program terminated with signal 4, Illegal instruction.
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld.so.1...done.
Loaded symbols for /lib/ld.so.1
#0  0x8000060e in rec ()
(gdb) info reg
d0             0x8000057c	-2147482244
d1             0xc0017000	-1073647616
d2             0xd1d2d3d4	-774712364
d3             0xe1e2e3e4	-505224220
d4             0xf1f2f3f4	-235736076
d5             0x80096168	-2146868888
d6             0x80093108	-2146881272
d7             0x0	0
a0             0x0	0x0
a1             0xefdadbdc	0xefdadbdc
a2             0x91929394	0x91929394
a3             0xa1a2a3a4	0xa1a2a3a4
a4             0x8000057c	0x8000057c
a5             0xc1c2c3c4	0xc1c2c3c4
fp             0xef87402c	0xef87402c
sp             0xef874010	0xef874010
ps             0x209	521
pc             0x8000060e	0x8000060e <rec+242>
fpcontrol      0x0	0
fpstatus       0x0	0
fpiaddr        0x0	0
(gdb)


Am 20.04.2023 um 14:57 schrieb Finn Thain:
On Thu, 20 Apr 2023, Michael Schmitz wrote:

Can you try and fault in as many of these stack pages as possible, ahead
of filling the stack? (Depending on how much RAM you have ...). Maybe we
would need to lock those pages into memory? Just to show that with no
page faults (but still signals) there is no corruption?


OK.

Any signal frames or exception frames have been completely overwritten
because the recursion continued after the corruption took place. So
there's not much to see in the core dump.

We'd need a way to stop recursion once the first corruption has taken
place. If the 'safe' recursion depth of 10131 is constant, the dump
taken at that point should look similar to what you saw in dash
(assuming it is the page fault and subsequent signal return that causes
the corruption).


It turns out that the recursion depth can be set a lot lower than the
200000 that I chose in that test program. (I used that value as it kept
the stack size just below the default 8192 kB limit.)

And it does keep the core a lot smaller. Still not hard to work with on my 14 MB RAM Falcon...


At depth = 2500, a failure is around 95% certain. At depth = 2048 I can
still get an intermittent failure. This only required 21 stack pagefaults
and one fork.

I suspect that the location of the corruption is probably somewhat random,
and the larger the stack happens to be when the signal comes in, the
better the odds of detection.

Yep, but there must me some more to that. Timing of page faults due to swap bandwidth, perhaps?

Cheers,

	Michael





[Index of Archives]     [Video for Linux]     [Yosemite News]     [Linux S/390]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux