Hello Cephers, lately, our ceph-cluster started to show some weird behavior: the osd boxes show a load of 5000-15000 before the osds get marked down. Usually the box is fully usable, even "apt-get dist-upgrade" runs smoothly, you can read and write to any disk, only things you can't do are strace the osd processes, sync or reboot. we only find some logs about the "xfsaild = XFS Access Item List Daemon" as hung_task warnings. Dec 7 15:36:32 ceph1-store204 kernel: [152066.016108] [<ffffffff81093790>] ? kthread_create_on_node+0x1c0/0x1c0 Dec 7 15:36:32 ceph1-store204 kernel: [152066.016112] INFO: task xfsaild/dm-1:1445 blocked for more than 120 seconds. Dec 7 15:36:32 ceph1-store204 kernel: [152066.016329] Tainted: G C 3.19.0-39-generic #44~14.04.1-Ubuntu Dec 7 15:36:32 ceph1-store204 kernel: [152066.016558] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 7 15:36:32 ceph1-store204 kernel: [152066.016802] xfsaild/dm-1 D ffff8807faa03af8 0 1445 2 0x00000000 Dec 7 15:36:32 ceph1-store204 kernel: [152066.016805] ffff8807faa03af8 ffff8808098989d0 0000000000013e80 ffff8807faa03fd8 Dec 7 15:36:32 ceph1-store204 kernel: [152066.016808] 0000000000013e80 ffff88080bb775c0 ffff8808098989d0 ffff88011381b2a8 Dec 7 15:36:32 ceph1-store204 kernel: [152066.016812] ffff8807faa03c50 7fffffffffffffff ffff8807faa03c48 ffff8808098989d0 Dec 7 15:36:32 ceph1-store204 kernel: [152066.016815] Call Trace: Dec 7 15:36:32 ceph1-store204 kernel: [152066.016819] [<ffffffff817b2fd9>] schedule+0x29/0x70 Dec 7 15:36:32 ceph1-store204 kernel: [152066.016823] [<ffffffff817b609c>] schedule_timeout+0x20c/0x280 Dec 7 15:36:32 ceph1-store204 kernel: [152066.016826] [<ffffffff810a40a5>] ? sched_clock_cpu+0x85/0xc0 Dec 7 15:36:32 ceph1-store204 kernel: [152066.016830] [<ffffffff810a0911>] ? try_to_wake_up+0x1f1/0x340 Dec 7 15:36:32 ceph1-store204 kernel: [152066.016834] [<ffffffff817b3d04>] wait_for_completion+0xa4/0x170 Dec 7 15:36:32 ceph1-store204 kernel: [152066.016836] [<ffffffff810a0ad0>] ? wake_up_state+0x20/0x20 Dec 7 15:36:32 ceph1-store204 kernel: [152066.016840] [<ffffffff8108e86d>] flush_work+0xed/0x1c0 Dec 7 15:36:32 ceph1-store204 kernel: [152066.016846] [<ffffffff8108acc0>] ? destroy_worker+0x90/0x90 Dec 7 15:36:32 ceph1-store204 kernel: [152066.016870] [<ffffffffc06f556e>] xlog_cil_force_lsn+0x7e/0x1f0 [xfs] Dec 7 15:36:32 ceph1-store204 kernel: [152066.016873] [<ffffffff810daddb>] ? lock_timer_base.isra.36+0x2b/0x50 Dec 7 15:36:32 ceph1-store204 kernel: [152066.016878] [<ffffffff810dbdcf>] ? try_to_del_timer_sync+0x4f/0x70 Dec 7 15:36:32 ceph1-store204 kernel: [152066.016901] [<ffffffffc06f3980>] _xfs_log_force+0x60/0x270 [xfs] Dec 7 15:36:32 ceph1-store204 kernel: [152066.016904] [<ffffffff810daba0>] ? internal_add_timer+0x80/0x80 Dec 7 15:36:32 ceph1-store204 kernel: [152066.016926] [<ffffffffc06f3bba>] xfs_log_force+0x2a/0x90 [xfs] Dec 7 15:36:32 ceph1-store204 kernel: [152066.016948] [<ffffffffc06fe340>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] Dec 7 15:36:32 ceph1-store204 kernel: [152066.016970] [<ffffffffc06fe480>] xfsaild+0x140/0x5a0 [xfs] Dec 7 15:36:32 ceph1-store204 kernel: [152066.016992] [<ffffffffc06fe340>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] Dec 7 15:36:32 ceph1-store204 kernel: [152066.016996] [<ffffffff81093862>] kthread+0xd2/0xf0 Dec 7 15:36:32 ceph1-store204 kernel: [152066.017000] [<ffffffff81093790>] ? kthread_create_on_node+0x1c0/0x1c0 Dec 7 15:36:32 ceph1-store204 kernel: [152066.017005] [<ffffffff817b72d8>] ret_from_fork+0x58/0x90 Dec 7 15:36:32 ceph1-store204 kernel: [152066.017009] [<ffffffff81093790>] ? kthread_create_on_node+0x1c0/0x1c0 Dec 7 15:36:32 ceph1-store204 kernel: [152066.017013] INFO: task xfsaild/dm-6:1616 blocked for more than 120 seconds. kswapd is also reported as hung, but we don't have swap on the osds. It looks like either all ceph-osd-threads are reporting in as willing to work, or it's the xfs-maintenance-process itself like described in [1,2] Usually if we aint fast enough setting no{out,scrub,deep-scrub} this has an avalanche effect where we usually end up ipmi-power-cycling half of the cluster because all the osd-nodes are busy doing nothing (according to iostat or top, exept the load). Is this a known bug for kernel 3.19.0-39 (ubuntu 14.04 with the vivid kernel)? Do the xfs-tweaks described here https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg25295.html (i know this is for a pull request modifying the write-paths) look decent or worth a try? Currently we're running with "back to defaults" and less load (desperate try with the filestore settings, didnt change anything) ceph.conf-osd section: [osd] filestore max sync interval = 15 filestore min sync interval = 1 osd max backfills = 1 osd recovery op priority = 1 as a baffled try to get it to survive more than a day at a stretch. Maybe kernel 4.2 is worth a try? Thx for any input Benedikt [1] https://www.reddit.com/r/linux/comments/18kvdb/xfsaild_is_creating_tons_of_system_threads_and/ [2] http://serverfault.com/questions/497049/the-xfs-filesystem-is-broken-in-rhel-centos-6-x-what-can-i-do-about-it _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com