Re: Kernel:BUG: Soft Lockup, H/W or S/W Issue?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 05/12/2016 08:00 AM, Lazuardi Nasution wrote:
Hi,

Suddenly some of our Infernalis OSD nodes are down with "kernel:BUG: soft lockup" message. Nothing can do after that until rebooting. When I do recovery by restarting the down OSDs, one by one while add additional OSDs too, I get the same error again with on the same nodes. I'm not sure which of "ceph-disk activate" of recovery or "ceph-disk prepare" of new additional OSDs is related or both or none of them is related. Maybe following trace can help whoever can understand it. Any though?

Best regards,


[...]

this is the important one (ceph-osd and xfs are your higher/mid level layers)

May 12 17:27:20 storage-b kernel: NMI backtrace for cpu 22
May 12 17:27:20 storage-b kernel: CPU: 22 PID: 0 Comm: swapper/22 Tainted: G W OEL ------------ 3.10.0-

327.13.1.el7.x86_64 #1
May 12 17:27:20 storage-b kernel: Hardware name: Supermicro SSG-2028R-E1CR24N/X10DRi-T4+, BIOS 1.0b 01/29/2015 May 12 17:27:20 storage-b kernel: task: ffff8820291f4500 ti: ffff881029278000 task.ti: ffff881029278000 May 12 17:27:20 storage-b kernel: RIP: 0010:[<ffffffff8135df87>] [<ffffffff8135df87>] intel_idle+0xd7/0x160

We've seen a number of problems with the idle driver and c-state entry/exit resulting in things like this. Not always the specific issue, we've also seen softirq contexts piling up in some cases with specific kernels using xfs backing stores (even with cpu_idle and other things removed).

Which kernel is it, which network cards/drivers are you using? Your bios on the SM hardware indicates 1.0b ... is there an update (yeah, it matters).

One thing you can try, though it might not be the complete solution, is to add this to the kernel boot line:

    cpuidle.off=1

It will cause the system not to invoke the idle driver, which hopefully will stop the messages about a stuck CPU.

May 12 17:27:20 storage-b kernel: Call Trace:
May 12 17:27:20 storage-b kernel: [<ffffffff814d46e0>] cpuidle_enter_state+0x40/0xc0 May 12 17:27:20 storage-b kernel: [<ffffffff814d4839>] cpuidle_idle_call+0xd9/0x210 May 12 17:27:20 storage-b kernel: [<ffffffff8101e4be>] arch_cpu_idle+0xe/0x30 May 12 17:27:20 storage-b kernel: [<ffffffff810d6325>] cpu_startup_entry+0x245/0x290 May 12 17:27:20 storage-b kernel: [<ffffffff810475fa>] start_secondary+0x1ba/0x230

FWIW: we've seen all manner of problems with the 3.10 kernel series under heavy loads with xfs backing stores. We haven't done a bisection on it to see where the problem is, as the 3.18 (day job's current stable kernel) resolves those issues. If you want to try an updated kernel, this is another avenue.

--
Joe Landman
e: joe.landman@xxxxxxxxx
t: @sijoe

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux