Re: bisected kernel crash on sparc64 with stress-ng

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2/25/21 12:12 PM, Meelis Roos wrote:
1. If you want to do it just with ng-stress, you could change it so that instead of generating a random opcode and executing it, generate a list of (many) random opcodes on your ssh client, and send them over to the test machine to be executed. If the system doesn't crash or hang, generate a new list and try again. If it does crash, then do a binary search on the list of opcodes to find the culprit.

Well, it generates many opcodes butI do not feel like redesigning stress-ng  opcode stressto client-server solution, I better go with your kernel modifications.

2. If that sounds like too much work, we could print the instructions in the kernel when we know we're going to return true. (Sorry the formatting of this will likely be messed up):

Tried it on top of todays git 5.11.0-09786-g3b9cdafb535


[   92.724186] fixing up no fault insn c6c310ca
[   94.675033] fixing up no fault insn c8c6d0de
[   94.742247] fixing up no fault insn c8c6d0de


Login incorrect
v240 login:
Password:

Login incorrect
v240 login: [  125.751204] fixing up no fault insn dad750ec

Login timed out
Debian GNU/Linux stretch/sid v240 ttyS0

v240 login: [  128.809516] fixing up no fault insn ea8fd1cb
[  133.757945] fixing up no fault insn fff21079
[  133.819635] fixing up no fault insn fff21079
[  134.605780] fixing up no fault insn e09810de

Debian GNU/Linux stretch/sid v240 ttyS0

v240 login: [  138.514897] fixing up no fault insn cf95d1ef
[  138.571102] fixing up no fault insn cf95d1ef
[  138.627244] fixing up no fault insn cf95d1ef
[  138.683339] fixing up no fault insn cf95d1ef
[  138.739382] fixing up no fault insn cf95d1ef
[  138.795443] fixing up no fault insn cf95d1ef
[  138.851583] fixing up no fault insn cf95d1ef
[  138.907736] fixing up no fault insn cf95d1ef
[  138.963879] fixing up no fault insn cf95d1ef
[  139.020024] fixing up no fault insn cf95d1ef
[  139.076068] fixing up no fault insn cf95d1ef
[  139.132114] fixing up no fault insn cf95d1ef
[  139.188159] fixing up no fault insn cf95d1ef
[  139.244203] fixing up no fault insn cf95d1ef
[  139.300251] fixing up no fault insn cf95d1ef
[  139.356293] fixing up no fault insn cf95d1ef
[  139.412339] fixing up no fault insn cf95d1ef
[  139.468386] fixing up no fault insn cf95d1ef
[  139.524432] fixing up no fault insn cf95d1ef
[  139.580474] fixing up no fault insn cf95d1ef
[  139.636524] fixing up no fault insn cf95d1ef
[  139.692570] fixing up no fault insn cf95d1ef
[  139.748607] fixing up no fault insn cf95d1ef
[  139.804655] fixing up no fault insn cf95d1ef
[  139.860720] fixing up no fault insn cf95d1ef
[  139.860869] Kernel unaligned access at TPC[4add34] cpuacct_charge+0x74/0x80 [  139.916835] Kernel unaligned access at TPC[469db0] irq_enter_rcu+0x10/0x80


OK, this is great data. I think I know what is causing this.




From two boots, the insn varies among
c798d0c9
c8c6d0de
cf95d1ef
d49cd066
dad750ec
e09810de
e3e790c4
e5a051cb
e7f21165
ea8fd1cb
ebb611fc
f4c551de
fe8690fd
fff21079


Are you saying that in this list of instructions, each one of them causes a crash or hang?




On last try, "fixing up no fault insn ebb611fc" appeared many times and the the machine hung with nothing more on seerial console. This was the second hang like that.




3. I have a theory that the instruction may be something like this:

         sta %f0, [ %g0 ] #ASI_PNF

which should assemble to 0xc1a01040. You could just try this instruction.

Putting 0xc1a01040 at the start of the opcode sequence makes the test spew this in dmesg 26 times:
fixing up no fault insn c1a01040
and then the kernel hangs.


OK, that means that guess was correct. Almost have all I need...


4. If this does result in a crash, this patch might be the fix:

Yes, with this patch only, it works for multiple minutes and is stable. Nothing in dmesg either.

5. Here is another patch to try after the others:


This resulted in a crash (this is different, irq5 during mm code):

[  304.847868] Unable to handle kernel paging request at virtual address ffffffffffffe000


But what was the last "fixing up no fault insn" message you got before this panic? I need that to be sure that this is just another instance of the other panics and not a different cause.

Also, did you apply this code patch along with others or was it alone? If alone, please try running with all 3 patches applied. My logic leads me to believe that you should not see any panics/hangs with all the code changes applied.

I think the important test cases are c1a01040 (which should be fixed by the first code patch) and cf95d1ef, (which should be fixed by the second code patch.)



Rob





[ 304.952010] tsk->{mm,active_mm}->context = 00000000000009be
[  305.025294] tsk->{mm,active_mm}->pgd = fff0000000db6000
[  305.093913]               \|/ ____ \|/
[  305.093913]               "@'/ .. \`@"
[  305.093913]               /_| \__/ |_\
[  305.093913]                  \__U_/
[  305.287234] stress-ng-opcod(1517): Oops [#1]
[  305.343363] CPU: 1 PID: 1517 Comm: stress-ng-opcod Not tainted 5.11.0-09786-g3b9cdafb535-dirty #294 [  305.462321] TSTATE: 0000004480001603 TPC: 000000000089ad98 TNPC: 000000000089ad9c Y: 00000000    Not tainted
[  305.591565] TPC: <__inet_lookup_established+0x78/0x1e0>
[  305.660186] g0: fff0000000a993c1 g1: 0000000000000000 g2: 2057cf51ce000000 g3: 000000000057cf51 [  305.774569] g4: fff0000000f152c0 g5: fff000133ee8c000 g6: fff000000107c000 g7: 5973ffef02e64d70 [  305.888946] o0: 00000000000065c8 o1: 30222850b2de49fe o2: 0000000000160000 o3: 6857e211521f25c5 [  306.003325] o4: 0000000340f12326 o5: 0000000000a8f400 sp: fff000133fe1ed81 ret_pc: 000000000089ad4c
[  306.122278] RPC: <__inet_lookup_established+0x2c/0x1e0>
[  306.190900] l0: 0000000000000002 l1: 0000000000000000 l2: fff00000006b2e40 l3: 0000000000010000 [  306.305281] l4: 0000000000000001 l5: fff0000000be8980 l6: fff0000000be8840 l7: fff0000000be8840 [  306.419659] i0: 0000000000b30640 i1: 00000000000065c8 i2: 00000000d98965c8 i3: 0000000000000000 [  306.534037] i4: c0a80101c0a8018e i5: 00000000e4230016 i6: fff000133fe1ee31 i7: 00000000008bee58
[  306.648415] I7: <tcp_v4_early_demux+0x98/0x160>
[  306.707887] Call Trace:
[  306.739910] [<00000000008bee58>] tcp_v4_early_demux+0x98/0x160
[  306.816544] [<000000000088f178>] ip_rcv_finish_core.isra.17+0x318/0x420
[  306.903472] [<000000000088f6cc>] ip_list_rcv_finish.isra.19+0x6c/0x140
[  306.989256] [<000000000088fc5c>] ip_list_rcv+0x11c/0x140
[  307.059025] [<0000000000834658>] __netif_receive_skb_list_core+0x138/0x240 [  307.149386] [<0000000000834970>] netif_receive_skb_list_internal+0x210/0x300
[  307.242031] [<0000000000834a68>] gro_normal_list.part.188+0x8/0x40
[  307.323239] [<0000000000835e8c>] napi_complete_done+0x14c/0x1e0
[  307.401015] [<000000001002fc80>] tg3_poll+0x140/0x460 [tg3]
[  307.474326] [<00000000008360a4>] __napi_poll+0x44/0x1a0
[  307.542948] [<00000000008363c4>] net_rx_action+0xc4/0x240
[  307.613861] [<000000000095e170>] __do_softirq+0xd0/0x260
[  307.683633] [<000000000042c86c>] do_softirq_own_stack+0x2c/0x40
[  307.761410] [<0000000000469fa8>] irq_exit+0xc8/0xe0
[  307.825461] [<000000000095de40>] handler_irq+0xc0/0x100
[  307.894087] [<00000000004208b4>] tl0_irq5+0x14/0x20
[  307.958140] Disabling lock debugging due to kernel taint
[  308.027910] Caller[00000000008bee58]: tcp_v4_early_demux+0x98/0x160
[  308.110263] Caller[000000000088f178]: ip_rcv_finish_core.isra.17+0x318/0x420 [  308.202910] Caller[000000000088f6cc]: ip_list_rcv_finish.isra.19+0x6c/0x140
[  308.294411] Caller[000000000088fc5c]: ip_list_rcv+0x11c/0x140
[  308.369898] Caller[0000000000834658]: __netif_receive_skb_list_core+0x138/0x240 [  308.465981] Caller[0000000000834970]: netif_receive_skb_list_internal+0x210/0x300 [  308.564346] Caller[0000000000834a68]: gro_normal_list.part.188+0x8/0x40
[  308.651270] Caller[0000000000835e8c]: napi_complete_done+0x14c/0x1e0
[  308.734766] Caller[000000001002fc80]: tg3_poll+0x140/0x460 [tg3]
[  308.813791] Caller[00000000008360a4]: __napi_poll+0x44/0x1a0
[  308.888134] Caller[00000000008363c4]: net_rx_action+0xc4/0x240
[  308.964769] Caller[000000000095e170]: __do_softirq+0xd0/0x260
[  309.040257] Caller[000000000042c86c]: do_softirq_own_stack+0x2c/0x40
[  309.123754] Caller[0000000000469fa8]: irq_exit+0xc8/0xe0
[  309.193523] Caller[000000000095de40]: handler_irq+0xc0/0x100
[  309.267869] Caller[00000000004208b4]: tl0_irq5+0x14/0x20
[  309.337640] Caller[000000000055e5d0]: __handle_mm_fault+0x190/0xaa0
[  309.419992] Caller[000000000055ef74]: handle_mm_fault+0x94/0x220
[  309.498913] Caller[0000000000451824]: do_sparc64_fault+0x264/0x6e0
[  309.580120] Caller[0000000000407714]: sparc64_realfault_common+0x10/0x20
[  309.668191] Caller[00000000f7b5f298]: 0xf7b5f298
[  309.728811] Instruction DUMP:
[  309.728815]  808ee001
[  309.767698]  32600043
[  309.798579]  b736f001
[  309.829461] <c206ffa0>
[  309.860342]  80a0401a
[  309.891225]  124ffffa
[  309.922107]  01000000
[  309.952988]  c206ffa4
[  309.983871]  80a74001
[  310.014753]
[  310.065080] Kernel panic - not syncing: Aiee, killing interrupt handler!
[  310.153153] ------------[ cut here ]------------
[  310.213767] WARNING: CPU: 1 PID: 1517 at kernel/smp.c:633 smp_call_function_many_cond+0x3bc/0x400
[  310.330439] Modules linked in: loop flash tg3
[  310.387621] CPU: 1 PID: 1517 Comm: stress-ng-opcod Tainted: G      D           5.11.0-09786-g3b9cdafb535-dirty #294
[  310.524881] Call Trace:
[  310.556899] [<0000000000463ea8>] __warn+0x88/0xe0
[  310.618665] [<0000000000463f58>] warn_slowpath_fmt+0x58/0x80
[  310.693010] [<00000000004ef8bc>] smp_call_function_many_cond+0x3bc/0x400
[  310.781083] [<00000000004efb7c>] smp_call_function+0x1c/0x40
[  310.855426] [<0000000000953e2c>] panic+0x11c/0x334
[  310.918333] [<0000000000468ebc>] do_exit+0x8bc/0xbc0
[  310.983529] [<000000000042a854>] die_if_kernel+0x194/0x300
[  311.055587] [<000000000095389c>] unhandled_fault+0x84/0x90
[  311.127646] [<0000000000451a2c>] do_sparc64_fault+0x46c/0x6e0
[  311.203135] [<0000000000407714>] sparc64_realfault_common+0x10/0x20
[  311.285488] [<000000000089ad98>] __inet_lookup_established+0x78/0x1e0
[  311.370127] [<00000000008bee58>] tcp_v4_early_demux+0x98/0x160
[  311.446760] [<000000000088f178>] ip_rcv_finish_core.isra.17+0x318/0x420
[  311.533687] [<000000000088f6cc>] ip_list_rcv_finish.isra.19+0x6c/0x140
[  311.619471] [<000000000088fc5c>] ip_list_rcv+0x11c/0x140
[  311.689241] [<0000000000834658>] __netif_receive_skb_list_core+0x138/0x240
[  311.779601] ---[ end trace bb4c0255fe0bffe8 ]---
[  311.840221] ------------[ cut here ]------------
[  311.900838] WARNING: CPU: 1 PID: 1517 at kernel/smp.c:498 smp_call_function_single+0x188/0x1c0
[  312.014078] Modules linked in: loop flash tg3
[  312.071261] CPU: 1 PID: 1517 Comm: stress-ng-opcod Tainted: G      D W         5.11.0-09786-g3b9cdafb535-dirty #294
[  312.208523] Call Trace:
[  312.240539] [<0000000000463ea8>] __warn+0x88/0xe0
[  312.302304] [<0000000000463f58>] warn_slowpath_fmt+0x58/0x80
[  312.376652] [<00000000004ef4c8>] smp_call_function_single+0x188/0x1c0
[  312.461291] [<00000000004efb7c>] smp_call_function+0x1c/0x40
[  312.535637] [<0000000000953e2c>] panic+0x11c/0x334
[  312.598543] [<0000000000468ebc>] do_exit+0x8bc/0xbc0
[  312.663739] [<000000000042a854>] die_if_kernel+0x194/0x300
[  312.735796] [<000000000095389c>] unhandled_fault+0x84/0x90
[  312.807856] [<0000000000451a2c>] do_sparc64_fault+0x46c/0x6e0
[  312.883344] [<0000000000407714>] sparc64_realfault_common+0x10/0x20
[  312.965698] [<000000000089ad98>] __inet_lookup_established+0x78/0x1e0
[  313.050337] [<00000000008bee58>] tcp_v4_early_demux+0x98/0x160
[  313.126970] [<000000000088f178>] ip_rcv_finish_core.isra.17+0x318/0x420
[  313.213897] [<000000000088f6cc>] ip_list_rcv_finish.isra.19+0x6c/0x140
[  313.299679] [<000000000088fc5c>] ip_list_rcv+0x11c/0x140
[  313.369450] [<0000000000834658>] __netif_receive_skb_list_core+0x138/0x240
[  313.459809] ---[ end trace bb4c0255fe0bffe9 ]---
[  313.520436] Press Stop-A (L1-A) from sun keyboard or send break
[  313.520436] twice on console to return to the boot prom
[  313.666839] ---[ end Kernel panic - not syncing: Aiee, killing interrupt handler! ]---


Let me know what you find out from all this and I'll try to come up with more ideas.

OK, I can try more things. And thank you for quick response times!






[Index of Archives]     [Kernel Development]     [DCCP]     [Linux ARM Development]     [Linux]     [Photo]     [Yosemite Help]     [Linux ARM Kernel]     [Linux SCSI]     [Linux x86_64]     [Linux Hams]

  Powered by Linux