Hi Zhouyi, On Sat, Apr 22, 2023 at 2:47 PM Zhouyi Zhou <zhouzhouyi@xxxxxxxxx> wrote: > > Dear PowerPC and RCU developers: > During the RCU torture test on mainline (on the VM of Opensource Lab > of Oregon State University), SRCU-P failed with __stack_chk_fail: > [ 264.381952][ T99] [c000000006c7bab0] [c0000000010c67c0] > dump_stack_lvl+0x94/0xd8 (unreliable) > [ 264.383786][ T99] [c000000006c7bae0] [c00000000014fc94] panic+0x19c/0x468 > [ 264.385128][ T99] [c000000006c7bb80] [c0000000010fca24] > __stack_chk_fail+0x24/0x30 > [ 264.386610][ T99] [c000000006c7bbe0] [c0000000002293b4] > srcu_gp_start_if_needed+0x5c4/0x5d0 > [ 264.388188][ T99] [c000000006c7bc70] [c00000000022f7f4] > srcu_torture_call+0x34/0x50 > [ 264.389611][ T99] [c000000006c7bc90] [c00000000022b5e8] > rcu_torture_fwd_prog+0x8c8/0xa60 > [ 264.391439][ T99] [c000000006c7be00] [c00000000018e37c] kthread+0x15c/0x170 > [ 264.392792][ T99] [c000000006c7be50] [c00000000000df94] > ret_from_kernel_thread+0x5c/0x64 > The kernel config file can be found in [1]. > And I write a bash script to accelerate the bug reproducing [2]. > After a week's debugging, I found the cause of the bug is because the > register r10 used to judge for stack overflow is not constant between > context switches. > The assembly code for srcu_gp_start_if_needed is located at [3]: > c000000000226eb4: 78 6b aa 7d mr r10,r13 > c000000000226eb8: 14 42 29 7d add r9,r9,r8 > c000000000226ebc: ac 04 00 7c hwsync > c000000000226ec0: 10 00 7b 3b addi r27,r27,16 > c000000000226ec4: 14 da 29 7d add r9,r9,r27 > c000000000226ec8: a8 48 00 7d ldarx r8,0,r9 > c000000000226ecc: 01 00 08 31 addic r8,r8,1 > c000000000226ed0: ad 49 00 7d stdcx. r8,0,r9 > c000000000226ed4: f4 ff c2 40 bne- c000000000226ec8 > <srcu_gp_start_if_needed+0x1c8> > c000000000226ed8: 28 00 21 e9 ld r9,40(r1) > c000000000226edc: 78 0c 4a e9 ld r10,3192(r10) > c000000000226ee0: 79 52 29 7d xor. r9,r9,r10 > c000000000226ee4: 00 00 40 39 li r10,0 > c000000000226ee8: b8 03 82 40 bne c0000000002272a0 > <srcu_gp_start_if_needed+0x5a0> > by debugging, I see the r10 is assigned with r13 on c000000000226eb4, > but if there is a context-switch before c000000000226edc, a false > positive will be reported. > > [1] http://154.220.3.115/logs/0422/configformainline.txt > [2] 154.220.3.115/logs/0422/whilebash.sh > [3] http://154.220.3.115/logs/0422/srcu_gp_start_if_needed.txt > > My analysis and debugging may not be correct, but the bug is easily > reproducible. Could you provide the full kernel log? It is not clear exactly from your attachments, but I think this is a stack overflow issue as implied by the mention of __stack_chk_fail. One trick might be to turn on any available stack debug kernel config options, or check the kernel logs for any messages related to shortage of remaining stack space. Additionally, you could also find out where the kernel crash happened in C code following the below notes [1] I wrote (see section "Figuring out where kernel crashes happen in C code"). The notes are x86-specific but should be generally applicable (In the off chance you'd like to improve the notes, feel free to share them ;-)). Lastly, is it a specific kernel release from which you start seeing this issue? You should try git bisect if it is easily reproducible in a newer release, but goes away in an older one. I will also join you in your debug efforts soon though I am currently in between conferences. [1] https://gist.github.com/joelagnel/ae15c404facee0eb3ebb8aff0e996a68 thanks, - Joel > > Thanks > Zhouyi