On 05/12/2016 08:00 AM, Lazuardi Nasution wrote:
Hi,
Suddenly some of our Infernalis OSD nodes are down with "kernel:BUG:
soft lockup" message. Nothing can do after that until rebooting. When
I do recovery by restarting the down OSDs, one by one while add
additional OSDs too, I get the same error again with on the same
nodes. I'm not sure which of "ceph-disk activate" of recovery or
"ceph-disk prepare" of new additional OSDs is related or both or none
of them is related. Maybe following trace can help whoever can
understand it. Any though?
Best regards,
[...]
this is the important one (ceph-osd and xfs are your higher/mid level
layers)
May 12 17:27:20 storage-b kernel: NMI backtrace for cpu 22
May 12 17:27:20 storage-b kernel: CPU: 22 PID: 0 Comm: swapper/22
Tainted: G W OEL ------------ 3.10.0-
327.13.1.el7.x86_64 #1
May 12 17:27:20 storage-b kernel: Hardware name: Supermicro
SSG-2028R-E1CR24N/X10DRi-T4+, BIOS 1.0b 01/29/2015
May 12 17:27:20 storage-b kernel: task: ffff8820291f4500 ti:
ffff881029278000 task.ti: ffff881029278000
May 12 17:27:20 storage-b kernel: RIP: 0010:[<ffffffff8135df87>]
[<ffffffff8135df87>] intel_idle+0xd7/0x160
We've seen a number of problems with the idle driver and c-state
entry/exit resulting in things like this. Not always the specific
issue, we've also seen softirq contexts piling up in some cases with
specific kernels using xfs backing stores (even with cpu_idle and other
things removed).
Which kernel is it, which network cards/drivers are you using? Your
bios on the SM hardware indicates 1.0b ... is there an update (yeah, it
matters).
One thing you can try, though it might not be the complete solution, is
to add this to the kernel boot line:
cpuidle.off=1
It will cause the system not to invoke the idle driver, which hopefully
will stop the messages about a stuck CPU.
May 12 17:27:20 storage-b kernel: Call Trace:
May 12 17:27:20 storage-b kernel: [<ffffffff814d46e0>]
cpuidle_enter_state+0x40/0xc0
May 12 17:27:20 storage-b kernel: [<ffffffff814d4839>]
cpuidle_idle_call+0xd9/0x210
May 12 17:27:20 storage-b kernel: [<ffffffff8101e4be>]
arch_cpu_idle+0xe/0x30
May 12 17:27:20 storage-b kernel: [<ffffffff810d6325>]
cpu_startup_entry+0x245/0x290
May 12 17:27:20 storage-b kernel: [<ffffffff810475fa>]
start_secondary+0x1ba/0x230
FWIW: we've seen all manner of problems with the 3.10 kernel series
under heavy loads with xfs backing stores. We haven't done a bisection
on it to see where the problem is, as the 3.18 (day job's current stable
kernel) resolves those issues. If you want to try an updated kernel,
this is another avenue.
--
Joe Landman
e: joe.landman@xxxxxxxxx
t: @sijoe
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com