On Sun, Apr 23, 2023 at 3:19 AM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote: > > Hi Zhouyi, Thank Joel for your quick response ;-) I will gradually provide all the necessary information to facilitate our chasing. Please do not hesitate email me if I have ignored any ;-) > > On Sat, Apr 22, 2023 at 2:47 PM Zhouyi Zhou <zhouzhouyi@xxxxxxxxx> wrote: > > > > Dear PowerPC and RCU developers: > > During the RCU torture test on mainline (on the VM of Opensource Lab > > of Oregon State University), SRCU-P failed with __stack_chk_fail: > > [ 264.381952][ T99] [c000000006c7bab0] [c0000000010c67c0] > > dump_stack_lvl+0x94/0xd8 (unreliable) > > [ 264.383786][ T99] [c000000006c7bae0] [c00000000014fc94] panic+0x19c/0x468 > > [ 264.385128][ T99] [c000000006c7bb80] [c0000000010fca24] > > __stack_chk_fail+0x24/0x30 > > [ 264.386610][ T99] [c000000006c7bbe0] [c0000000002293b4] > > srcu_gp_start_if_needed+0x5c4/0x5d0 > > [ 264.388188][ T99] [c000000006c7bc70] [c00000000022f7f4] > > srcu_torture_call+0x34/0x50 > > [ 264.389611][ T99] [c000000006c7bc90] [c00000000022b5e8] > > rcu_torture_fwd_prog+0x8c8/0xa60 > > [ 264.391439][ T99] [c000000006c7be00] [c00000000018e37c] kthread+0x15c/0x170 > > [ 264.392792][ T99] [c000000006c7be50] [c00000000000df94] > > ret_from_kernel_thread+0x5c/0x64 > > The kernel config file can be found in [1]. > > And I write a bash script to accelerate the bug reproducing [2]. > > After a week's debugging, I found the cause of the bug is because the > > register r10 used to judge for stack overflow is not constant between > > context switches. > > The assembly code for srcu_gp_start_if_needed is located at [3]: > > c000000000226eb4: 78 6b aa 7d mr r10,r13 > > c000000000226eb8: 14 42 29 7d add r9,r9,r8 > > c000000000226ebc: ac 04 00 7c hwsync > > c000000000226ec0: 10 00 7b 3b addi r27,r27,16 > > c000000000226ec4: 14 da 29 7d add r9,r9,r27 > > c000000000226ec8: a8 48 00 7d ldarx r8,0,r9 > > c000000000226ecc: 01 00 08 31 addic r8,r8,1 > > c000000000226ed0: ad 49 00 7d stdcx. r8,0,r9 > > c000000000226ed4: f4 ff c2 40 bne- c000000000226ec8 > > <srcu_gp_start_if_needed+0x1c8> > > c000000000226ed8: 28 00 21 e9 ld r9,40(r1) > > c000000000226edc: 78 0c 4a e9 ld r10,3192(r10) > > c000000000226ee0: 79 52 29 7d xor. r9,r9,r10 > > c000000000226ee4: 00 00 40 39 li r10,0 > > c000000000226ee8: b8 03 82 40 bne c0000000002272a0 > > <srcu_gp_start_if_needed+0x5a0> > > by debugging, I see the r10 is assigned with r13 on c000000000226eb4, > > but if there is a context-switch before c000000000226edc, a false > > positive will be reported. > > > > [1] http://154.220.3.115/logs/0422/configformainline.txt > > [2] 154.220.3.115/logs/0422/whilebash.sh > > [3] http://154.220.3.115/logs/0422/srcu_gp_start_if_needed.txt > > > > My analysis and debugging may not be correct, but the bug is easily > > reproducible. > > Could you provide the full kernel log? It is not clear exactly from > your attachments, but I think this is a stack overflow issue as > implied by the mention of __stack_chk_fail. One trick might be to turn > on any available stack debug kernel config options, or check the > kernel logs for any messages related to shortage of remaining stack > space. The full kernel log is [1] [1] http://154.220.3.115/logs/0422/console.log > > Additionally, you could also find out where the kernel crash happened > in C code following the below notes [1] I wrote (see section "Figuring > out where kernel crashes happen in C code"). The notes are > x86-specific but should be generally applicable (In the off chance > you'd like to improve the notes, feel free to share them ;-)). Fantastic article!!!, I benefit a lot from reading it. Because we can reproduce it so easily on powerpc VM, I can even use gdb to debug it, following is my debug process on 2e83b879fb91dafe995967b46a1d38a5b0889242(srcu: Create an srcu_read_lock_nmisafe() and srcu_read_unlock_nmisafe()). [2] http://154.220.3.115/logs/0422/gdb.txt > > Lastly, is it a specific kernel release from which you start seeing > this issue? You should try git bisect if it is easily reproducible in > a newer release, but goes away in an older one. I did bisect on powerpc VM, the problem begin to appear on 2e83b879fb91dafe995967b46a1d38a5b0889242(srcu: Create an srcu_read_lock_nmisafe() and srcu_read_unlock_nmisafe()). The kernel is good at 5d0f5953b60f5f7a278085b55ddc73e2932f4c33(srcu: Convert ->srcu_lock_count and ->srcu_unlock_count to atomic) But if I apply the following patch [3] to 5d0f5953b60f5f7a278085b55ddc73e2932f4c33, the bug appears again. [3] http://154.220.3.115/logs/0422/bug.patch Both native gcc on PPC vm (gcc version 9.4.0), and gcc cross compiler on my x86 laptop (gcc version 10.4.0) will reproduce the bug. > > I will also join you in your debug efforts soon though I am currently > in between conferences. Exciting!! Thank you very much! I can give you ssh access (based on rsa pub key) to PPC vm on Oregon State University if you like. Thanks again Zhouyi > > [1] https://gist.github.com/joelagnel/ae15c404facee0eb3ebb8aff0e996a68 > > thanks, > > - Joel > > > > > > > > Thanks > > Zhouyi