osd become unusable, blocked by xfsaild (?) and load > 5000

Benedikt Fraunhofer <fraunhofer@xxxxxxxxxx> · Tue, 8 Dec 2015 08:10:01 +0100

Hello Cephers,

lately, our ceph-cluster started to show some weird behavior:

the osd boxes show a load of 5000-15000 before the osds get marked down.
Usually the box is fully usable, even "apt-get dist-upgrade" runs smoothly,
you can read and write to any disk, only things you can't do are strace the osd
processes, sync or reboot.

we only find some logs about the "xfsaild = XFS Access Item List Daemon"
as hung_task warnings.

Dec  7 15:36:32 ceph1-store204 kernel: [152066.016108]
[<ffffffff81093790>] ? kthread_create_on_node+0x1c0/0x1c0
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016112] INFO: task
xfsaild/dm-1:1445 blocked for more than 120 seconds.
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016329]       Tainted:
G         C     3.19.0-39-generic #44~14.04.1-Ubuntu
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016558] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016802] xfsaild/dm-1
D ffff8807faa03af8     0  1445      2 0x00000000
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016805]
ffff8807faa03af8 ffff8808098989d0 0000000000013e80 ffff8807faa03fd8
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016808]
0000000000013e80 ffff88080bb775c0 ffff8808098989d0 ffff88011381b2a8
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016812]
ffff8807faa03c50 7fffffffffffffff ffff8807faa03c48 ffff8808098989d0
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016815] Call Trace:
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016819]
[<ffffffff817b2fd9>] schedule+0x29/0x70
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016823]
[<ffffffff817b609c>] schedule_timeout+0x20c/0x280
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016826]
[<ffffffff810a40a5>] ? sched_clock_cpu+0x85/0xc0
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016830]
[<ffffffff810a0911>] ? try_to_wake_up+0x1f1/0x340
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016834]
[<ffffffff817b3d04>] wait_for_completion+0xa4/0x170
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016836]
[<ffffffff810a0ad0>] ? wake_up_state+0x20/0x20
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016840]
[<ffffffff8108e86d>] flush_work+0xed/0x1c0
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016846]
[<ffffffff8108acc0>] ? destroy_worker+0x90/0x90
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016870]
[<ffffffffc06f556e>] xlog_cil_force_lsn+0x7e/0x1f0 [xfs]
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016873]
[<ffffffff810daddb>] ? lock_timer_base.isra.36+0x2b/0x50
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016878]
[<ffffffff810dbdcf>] ? try_to_del_timer_sync+0x4f/0x70
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016901]
[<ffffffffc06f3980>] _xfs_log_force+0x60/0x270 [xfs]
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016904]
[<ffffffff810daba0>] ? internal_add_timer+0x80/0x80
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016926]
[<ffffffffc06f3bba>] xfs_log_force+0x2a/0x90 [xfs]
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016948]
[<ffffffffc06fe340>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016970]
[<ffffffffc06fe480>] xfsaild+0x140/0x5a0 [xfs]
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016992]
[<ffffffffc06fe340>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
Dec  7 15:36:32 ceph1-store204 kernel: [152066.016996]
[<ffffffff81093862>] kthread+0xd2/0xf0
Dec  7 15:36:32 ceph1-store204 kernel: [152066.017000]
[<ffffffff81093790>] ? kthread_create_on_node+0x1c0/0x1c0
Dec  7 15:36:32 ceph1-store204 kernel: [152066.017005]
[<ffffffff817b72d8>] ret_from_fork+0x58/0x90
Dec  7 15:36:32 ceph1-store204 kernel: [152066.017009]
[<ffffffff81093790>] ? kthread_create_on_node+0x1c0/0x1c0
Dec  7 15:36:32 ceph1-store204 kernel: [152066.017013] INFO: task
xfsaild/dm-6:1616 blocked for more than 120 seconds.

kswapd is also reported as hung, but we don't have swap on the osds.

It looks like either all ceph-osd-threads are reporting in as willing to work,
or it's the xfs-maintenance-process itself like described in [1,2]

Usually if we aint fast enough setting no{out,scrub,deep-scrub} this
has an avalanche
effect where we usually end up ipmi-power-cycling half of the cluster
because all the osd-nodes
are busy doing nothing (according to iostat or top, exept the load).

Is this a known bug for kernel 3.19.0-39 (ubuntu 14.04 with the vivid kernel)?
Do the xfs-tweaks described here
https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg25295.html
(i know this is for a pull request modifying the write-paths)
look decent or worth a try?

Currently we're running with "back to defaults" and less load
(desperate try with the filestore settings, didnt change anything)
ceph.conf-osd section:

[osd]
  filestore max sync interval = 15
  filestore min sync interval = 1
  osd max backfills = 1
  osd recovery op priority = 1

as a baffled try to get it to survive more than a day at a stretch.

Maybe kernel 4.2 is worth a try?

Thx for any input
 Benedikt

[1] https://www.reddit.com/r/linux/comments/18kvdb/xfsaild_is_creating_tons_of_system_threads_and/
[2] http://serverfault.com/questions/497049/the-xfs-filesystem-is-broken-in-rhel-centos-6-x-what-can-i-do-about-it
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com