Re: bisected kernel crash on sparc64 with stress-ng

Meelis Roos <mroos@xxxxxxxx> · Mon, 22 Feb 2021 21:34:54 +0200

Hello!

1. https://www.spinics.net/lists/sparclinux/msg25915.html
2. https://www.spinics.net/lists/sparclinux/msg25917.html

I've looked at those and they don't contain the information I am interested in. I believe that stress-ng issues random opcodes in order to test how the system reacts. The actual random opcodes are what I need to see printed out directly from stress-ng before it actually executes the opcode. The kernel crash traces do not show those, just the aftermath. For instance, in the second trace I can see that the faulting instruction is c2070005 (lduw [ %i4 + %g5 ], %g1) and with i4: 00000000010e11c0 and g5: 794b00a7d5ede977, we can see how that instruction generated an unaligned access. But that is not the instruction executed by stress-ng, it's an instruction in the kernel, operating on faulty data, and I can't tell from the trace where that strange g5 value came from. The actual user instruction that was executed may provide a good hint.

I instrumented stress-ng with simple opcode block logging patch https://pastebin.com/1dZiCzCg and the results are hard to find usable, so far :(

1. The amount of code generated at each try is huge - last time it was more than the scrollback buffer of my "screen".

2. Adding these logging statements makes the bug harder to trigger - tried on 5.10 and it ran fine multiple times and then  failed but that took many minutes of running before the crash. I was observing the data over SSH, that might also change scheduling/CPU usage.

Any ideas for better logging that would not be in the way?

--
Meelis Roos <mroos@xxxxxxxx>