Re: bisected kernel crash on sparc64 with stress-ng

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



From two boots, the insn varies among
c798d0c9
c8c6d0de
cf95d1ef
d49cd066
dad750ec
e09810de
e3e790c4
e5a051cb
e7f21165
ea8fd1cb
ebb611fc
f4c551de
fe8690fd
fff21079


Are you saying that in this list of instructions, each one of them causes a crash or hang?

No, these just appear in dmesg, most of them do not seem to cause a crash because I did not have so many boots.My dmesg capture is unfortunately flaky with MikroTik SSH jumphost or Sun ALOM dropping the SSH console connection often.


which should assemble to 0xc1a01040. You could just try this instruction.

All tests after that start with this instruction and continue with random ones - I just overwite the start of opcode buffer with this.
4. If this does result in a crash, this patch might be the fix:

Yes, with this patch only, it works for multiple minutes and is stable. Nothing in dmesg either.

5. Here is another patch to try after the others:


This resulted in a crash (this is different, irq5 during mm code):

[  304.847868] Unable to handle kernel paging request at virtual address ffffffffffffe000


But what was the last "fixing up no fault insn" message you got before this panic? I need that to be sure that this is just another instance of the other panics and not a different cause.

Did not manage to capture this. Since this was the later kernel, I have it still around and retested - there was no "fixing up ..." message befor the crash (probably).


Also, did you apply this code patch along with others or was it alone? If alone, please try running with all 3 patches applied. My logic leads me to believe that you should not see any panics/hangs with all the code changes applied.

OK, will try them together - I did try one by one so far and applied some (at least one) by hand but I think successfully - might have failed because I applied it to the wrong state of code.

I think the important test cases are c1a01040 (which should be fixed by the first code patch) and cf95d1ef, (which should be fixed by the second code patch.)


Will try the patches incrementally and with both constants for overwriting the start of the opcode block.

--
Meelis Roos <mroos@xxxxxxxx>



[Index of Archives]     [Kernel Development]     [DCCP]     [Linux ARM Development]     [Linux]     [Photo]     [Yosemite Help]     [Linux ARM Kernel]     [Linux SCSI]     [Linux x86_64]     [Linux Hams]

  Powered by Linux