cpu stuck when raid5 was in recovery

hanguozhong <hanguozhong@xxxxxxxxxxxx> · Wed, 27 Mar 2013 15:43:08 +0800



Hello, everyone:
     I Created a 16*2T raid5 array yesterday just for test, the kernel 2.6.38 was used.
     And when the array was in recovery, there were lots of "kernel bugs" outputs of dmesg.
     The contents of the outputs was as follows:

     # BUG: soft lockup - CPU#9 stuck for 67s! [kworker/u:11:1193]
     Modules linked in: raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx bonding [last unloaded: scsi_wait_scan]

     Pid: 1193, comm:         kworker/u:11, CPU: 9
     r0 : 0x0000000000000244 r1 : 0xfffffe41f1506e80 r2 : 0xfffffe41f1436ec0
     r3 : 0xfffffe41f1426ec0 r4 : 0xfffffe41f1416ec0 r5 : 0xfffffe41f1506ea8
     r6 : 0xfffffe41f1506ea0 r7 : 0xfffffe41f1506e98 r8 : 0xfffffe41f1506e90
     r9 : 0xfffffe41f1506e88 r10: 0xfffffe41f1506eb8 r11: 0xfffffe41f1506eb0
     r12: 0xfffffe41f1416eb8 r13: 0xfffffe41f1426eb8 r14: 0x0000000000000000
     BUG: soft lockup - CPU#35 stuck for 67s! [kworker/u:16:1198]
     Modules linked in: raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx bonding [last unloaded: scsi_wait_scan]

      Pid: 1198, comm:         kworker/u:16, CPU: 35
      r0 : 0x000000000000015c r1 : 0xfffffe41f18aa8c0 r2 : 0xfffffe41f183a8c0
      r3 : 0xfffffe41f182a8c0 r4 : 0xfffffe41f181a8c0 r5 : 0xfffffe41f18aa8e8
      r6 : 0xfffffe41f18aa8e0 r7 : 0xfffffe41f18aa8d8 r8 : 0xfffffe41f18aa8d0
      r9 : 0xfffffe41f18aa8c8 r10: 0xfffffe41f18aa8f8 r11: 0xfffffe41f18aa8f0
      r12: 0xfffffe41f183a8c8 r13: 0xfffffe41f181a8d0 r14: 0x483158ac59313149
      r15: 0xfffffe41f183a8d0 r16: 0xfffffe41f182a8d0 r17: 0x0000000000000000
      r18: 0xef05894a64ff6660 r19: 0x0f23dc5939e9b8ba r20: 0xfffffe41f182a8c8
      r21: 0x3bc955b8012dd7f3 r22: 0xe02655135d16deda r23: 0x7cfb03994e2c49d9
      r24: 0x5b7fc5d225715c11 r25: 0x26c199892b1172bb r26: 0xc4d0c69121e38d22
      r27: 0x85c40e017cf99e36 r28: 0x85c40e017cf99e36 r29: 0xfffffe41f181a8c8
      r30: 0x834676b803e35c33 r31: 0xe2115f180af2ff99 r32: 0xfffffe01f63f8c90
      r33: 0x000000000000000f r34: 0x0000000000000000 r35: 0x0000000000000000
      r36: 0x0000000000000000 r37: 0x0000000000000000 r38: 0xffffffffffffffff
      r39: 0xfffffe0000a74348 r40: 0x000000000001f4da r41: 0xfffffe0000a71c80
      r42: 0x00000000007d3680 r43: 0x0000000000002740 r44: 0x0000000000000000
      r45: 0x1000000000000000 r46: 0x0000000000000000 r47: 0x0000000000000000
      r48: 0x0000000000000000 r49: 0x0000000000000000 r50: 0x0000000000000000
      r51: 0x0000000000000000 r52: 0xfffffe00008e3c80 tp : 0x000001f4ff950000
      sp : 0xfffffe01f36efc60 lr : 0x7dbe5c5b0e602eaa
      pc : 0xfffffff710281188 ex1: 1     faultnum: 22

      Starting stack dump of tid 1198, pid 1198 (kworker/u:16) on cpu 35 at cycle 899347866419
      frame 0: 0xfffffff710281188 xor_32regs_p_4.cold+0x80/0x1f0 [xor] (sp 0xfffffe01f36efc60)
      frame 1: 0xfffffff710280e00 xor_blocks.cold+0xc0/0x148 [xor] (sp 0xfffffe01f36efc70)
      frame 2: 0xfffffff7102e0238 async_xor.cold+0x238/0x340 [async_xor] (sp 0xfffffe01f36efc80)
      frame 3: 0xfffffff7102e0408 async_xor_val.cold+0xc8/0x278 [async_xor] (sp 0xfffffe01f36efcd0)
      frame 4: 0xfffffff7103a74d8 __raid_run_ops.cold+0x1180/0x1a78 [raid456] (sp 0xfffffe01f36efd28)
      frame 5: 0xfffffff7103a7e48 async_run_ops+0x78/0xa0 [raid456] (sp 0xfffffe01f36efde8)
      frame 6: 0xfffffff7000b4b90 async_run_entry_fn+0xd8/0x1f8 (sp 0xfffffe01f36efe08)
      frame 7: 0xfffffff7002999e8 process_one_work+0x1e8/0x538 (sp 0xfffffe01f36efe48)
      frame 8: 0xfffffff700274f78 worker_thread+0x378/0x898 (sp 0xfffffe01f36efea0)
      frame 9: 0xfffffff7000f0530 kthread+0xe0/0xe8 (sp 0xfffffe01f36eff80)
      frame 10: 0xfffffff7000bab38 start_kernel_thread+0x18/0x20 (sp 0xfffffe01f36effe8)
      Stack dump complete
      hrtimer: interrupt took 26799238 ns
      r15: 0x0000000000000000 r16: 0x4d03b72156442e8b r17: 0x0000000000000000
      r18: 0xe84615a32a3bbb31 r19: 0xfffffe41f1416ea8 r20: 0x0000000000000000
      r21: 0xe19a6fbcb2784276 r22: 0x5073bbff19c23f2b r23: 0x0000000000000000
      r24: 0xa6d8f2e28cc4eb4c r25: 0x0000000000000000 r26: 0x81b7047f7bf7e509
      r27: 0x1f2bb3f9c2f23efe r28: 0x1f2bb3f9c2f23efe r29: 0x4f580806db3001d5
      r30: 0x44dfcd3ece07d7cc r31: 0xac780ca019d138c6 r32: 0xfffffe01f63f5890
      r33: 0x000000000000000f r34: 0x0000000000000000 r35: 0x0000000000000000
      r36: 0x0000000000000000 r37: 0x0000000000000000 r38: 0xffffffffffffffff
      r39: 0xfffffe0000a76a88 r40: 0x000000000001f16e r41: 0xfffffe0000a743c0
      r42: 0x00000000007c5b80 r43: 0x0000000000002740 r44: 0x0000000000000001
      r45: 0x5000000000000000 r46: 0x0000000000000000 r47: 0x0000000000000000
      r48: 0x0000000000000000 r49: 0x0000000000000000 r50: 0x0000000000000000
      r51: 0x0000000000000000 r52: 0xfffffe00008e3c80 tp : 0x000001f4ff7b0000
      sp : 0xfffffe01f373fc60 lr : 0x8b17fa3deee23683
      pc : 0xfffffff710281270 ex1: 1     faultnum: 22

      Starting stack dump of tid 1193, pid 1193 (kworker/u:11) on cpu 9 at cycle 899797964239
      frame 0: 0xfffffff710281270 xor_32regs_p_4.cold+0x168/0x1f0 [xor] (sp 0xfffffe01f373fc60)
      frame 1: 0xfffffff710280e00 xor_blocks.cold+0xc0/0x148 [xor] (sp 0xfffffe01f373fc70)
      frame 2: 0xfffffff7102e0238 async_xor.cold+0x238/0x340 [async_xor] (sp 0xfffffe01f373fc80)
      frame 3: 0xfffffff7102e0408 async_xor_val.cold+0xc8/0x278 [async_xor] (sp 0xfffffe01f373fcd0)
      frame 4: 0xfffffff7103a74d8 __raid_run_ops.cold+0x1180/0x1a78 [raid456] (sp 0xfffffe01f373fd28)
      frame 5: 0xfffffff7103a7e48 async_run_ops+0x78/0xa0 [raid456] (sp 0xfffffe01f373fde8)
      frame 6: 0xfffffff7000b4b90 async_run_entry_fn+0xd8/0x1f8 (sp 0xfffffe01f373fe08)
      frame 7: 0xfffffff7002999e8 process_one_work+0x1e8/0x538 (sp 0xfffffe01f373fe48)
      frame 8: 0xfffffff700274f78 worker_thread+0x378/0x898 (sp 0xfffffe01f373fea0)
      frame 9: 0xfffffff7000f0530 kthread+0xe0/0xe8 (sp 0xfffffe01f373ff80)
      frame 10: 0xfffffff7000bab38 start_kernel_thread+0x18/0x20 (sp 0xfffffe01f373ffe8)
      Stack dump complete

      It seemed that "CPU9" had been occupied by process "kworker/u:11", and "CPU35" had been occupied by
      process "kworker/u:16". I used the tilera 36 core CPU and the frequency each core is 1.0G. When one of 
      the cores was stuck, the process bind to the core would be no response any more until the lockup was resolved.
      
      Is there any way to lower the priority of process such as "kworker/u:11"? I hope that programs bind to
      the specified core which could response immediately. can anyone help me??韬{.n?????%??檩??w?{.n???{炳盯w???塄}?财??j:+v??????2??璀??摺?囤??z夸z罐?+?????w棹f