Re: bisected kernel crash on sparc64 with stress-ng

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2/22/21 12:34 PM, Meelis Roos wrote:
Hello!

1. https://www.spinics.net/lists/sparclinux/msg25915.html
2. https://www.spinics.net/lists/sparclinux/msg25917.html

I've looked at those and they don't contain the information I am interested in. I believe that stress-ng issues random opcodes in order to test how the system reacts. The actual random opcodes are what I need to see printed out directly from stress-ng before it actually executes the opcode. The kernel crash traces do not show those, just the aftermath. For instance, in the second trace I can see that the faulting instruction is c2070005 (lduw [ %i4 + %g5 ], %g1) and with i4: 00000000010e11c0 and g5: 794b00a7d5ede977, we can see how that instruction generated an unaligned access. But that is not the instruction executed by stress-ng, it's an instruction in the kernel, operating on faulty data, and I can't tell from the trace where that strange g5 value came from. The actual user instruction that was executed may provide a good hint.


I instrumented stress-ng with simple opcode block logging patch https://pastebin.com/1dZiCzCg and the results are hard to find usable, so far :(

1. The amount of code generated at each try is huge - last time it was more than the scrollback buffer of my "screen".

2. Adding these logging statements makes the bug harder to trigger - tried on 5.10 and it ran fine multiple times and then  failed but that took many minutes of running before the crash. I was observing the data over SSH, that might also change scheduling/CPU usage.

Any ideas for better logging that would not be in the way?


Here are a few things to try:

1. If you want to do it just with ng-stress, you could change it so that instead of generating a random opcode and executing it, generate a list of (many) random opcodes on your ssh client, and send them over to the test machine to be executed. If the system doesn't crash or hang, generate a new list and try again. If it does crash, then do a binary search on the list of opcodes to find the culprit.

2. If that sounds like too much work, we could print the instructions in the kernel when we know we're going to return true. (Sorry the formatting of this will likely be messed up):

diff --git a/arch/sparc/kernel/traps_64.c b/arch/sparc/kernel/traps_64.c
index 27778b65a965..77e31d7c4097 100644
--- a/arch/sparc/kernel/traps_64.c
+++ b/arch/sparc/kernel/traps_64.c
@@ -277,11 +277,13 @@ bool is_no_fault_exception(struct pt_regs *regs)
                        asi = (insn >> 5);          /* immediate asi    */
                if ((asi & 0xf2) == ASI_PNF) {
                        if (insn & 0x1000000) {     /* op3[5:4]=3       */
+                               printk(KERN_ALERT "fixing up no fault insn %x\n", insn);
                                handle_ldf_stq(insn, regs);
                                return true;
                        } else if (insn & 0x200000) { /* op3[2], stores */
                                return false;
                        }
+                       printk(KERN_ALERT "fixing up no fault insn %x\n", insn);
                        handle_ld_nf(insn, regs);
                        return true;
                }

3. I have a theory that the instruction may be something like this:

        sta %f0, [ %g0 ] #ASI_PNF

which should assemble to 0xc1a01040. You could just try this instruction.

4. If this does result in a crash, this patch might be the fix:

diff --git a/arch/sparc/kernel/traps_64.c b/arch/sparc/kernel/traps_64.c
index 77e31d7c4097..c0d2e3665e69 100644
--- a/arch/sparc/kernel/traps_64.c
+++ b/arch/sparc/kernel/traps_64.c
@@ -276,12 +276,12 @@ bool is_no_fault_exception(struct pt_regs *regs)
                else
                        asi = (insn >> 5);          /* immediate asi    */
                if ((asi & 0xf2) == ASI_PNF) {
+                       if (insn & 0x200000)  /* op3[2], stores */
+                               return false;
                        if (insn & 0x1000000) {     /* op3[5:4]=3       */
                                printk(KERN_ALERT "fixing up no fault insn %x\n", insn);
                                handle_ldf_stq(insn, regs);
                                return true;
-                       } else if (insn & 0x200000) { /* op3[2], stores */
-                               return false;
                        }
                        printk(KERN_ALERT "fixing up no fault insn %x\n", insn);
                        handle_ld_nf(insn, regs);

5. Try the patch in #4 regardless of the outcome of step #3

5. Here is another patch to try after the others:

diff --git a/arch/sparc/kernel/traps_64.c b/arch/sparc/kernel/traps_64.c
index c0d2e3665e69..e383738fdd9f 100644
--- a/arch/sparc/kernel/traps_64.c
+++ b/arch/sparc/kernel/traps_64.c
@@ -275,7 +275,7 @@ bool is_no_fault_exception(struct pt_regs *regs)
                        asi = (regs->tstate >> 24); /* saved %asi       */
                else
                        asi = (insn >> 5);          /* immediate asi    */
-               if ((asi & 0xf2) == ASI_PNF) {
+               if (asi == ASI_PNF) {
                        if (insn & 0x200000)  /* op3[2], stores */
                                return false;
                        if (insn & 0x1000000) {     /* op3[5:4]=3       */


Let me know what you find out from all this and I'll try to come up with more ideas.


Rob






[Index of Archives]     [Kernel Development]     [DCCP]     [Linux ARM Development]     [Linux]     [Photo]     [Yosemite Help]     [Linux ARM Kernel]     [Linux SCSI]     [Linux x86_64]     [Linux Hams]

  Powered by Linux