Hi, Ilya, Thanks for your quick reply. Here is the link http://ceph.com/docs/cuttlefish/faq/ , under the "HOW CAN I GIVE CEPH A TRY?” section which talk about the old kernel stuff. By the way, what’s the main reason of using kernel 4.1, is there a lot of critical bugs fixed in that version despite perf improvements? I am worrying kernel 4.1 is too new that may introduce other problems. And if I’m using the librdb API, is the kernel version matters? In my tests, I built a 2-nodes cluster, each with only one OSD with os centos 7.1, kernel version 3.10.0.229 and ceph v0.94.2. I created several rbds and mkfs.xfs on those rbds to create filesystems. (kernel client were running on the ceph cluster) I performed heavy IO tests on those filesystems and found some fio got hung and turned into D state forever (uninterruptible sleep). I suspect it’s the deadlock that make the fio process hung. However the ceph-osd are stil responsive, and I can operate rbd via librbd API. Does this mean it’s not the loopback mount deadlock that cause the fio process hung? Or it is also a deadlock phnonmenon, only one thread is blocked in memory allocation and other threads are still possible to receive API requests, so the ceph-osd are still responsive? What worth mentioning is that after I restart the ceph-osd daemon, all processes in D state come back into normal state. Below is related log in kernel: Jul 7 02:25:39 node0 kernel: INFO: task xfsaild/rbd1:24795 blocked for more than 120 seconds. Jul 7 02:25:39 node0 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 7 02:25:39 node0 kernel: xfsaild/rbd1 D ffff880c2fc13680 0 24795 2 0x00000080 Jul 7 02:25:39 node0 kernel: ffff8801d6343d40 0000000000000046 ffff8801d6343fd8 0000000000013680 Jul 7 02:25:39 node0 kernel: ffff8801d6343fd8 0000000000013680 ffff880c0c0b0000 ffff880c0c0b0000 Jul 7 02:25:39 node0 kernel: ffff880c2fc14340 0000000000000001 0000000000000000 ffff8805bace2528 Jul 7 02:25:39 node0 kernel: Call Trace: Jul 7 02:25:39 node0 kernel: [<ffffffff81609e39>] schedule+0x29/0x70 Jul 7 02:25:39 node0 kernel: [<ffffffffa03a1890>] _xfs_log_force+0x230/0x290 [xfs] Jul 7 02:25:39 node0 kernel: [<ffffffff810a9620>] ? wake_up_state+0x20/0x20 Jul 7 02:25:39 node0 kernel: [<ffffffffa03a1916>] xfs_log_force+0x26/0x80 [xfs] Jul 7 02:25:39 node0 kernel: [<ffffffffa03a6390>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] Jul 7 02:25:39 node0 kernel: [<ffffffffa03a64e1>] xfsaild+0x151/0x5e0 [xfs] Jul 7 02:25:39 node0 kernel: [<ffffffffa03a6390>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] Jul 7 02:25:39 node0 kernel: [<ffffffff8109739f>] kthread+0xcf/0xe0 Jul 7 02:25:39 node0 kernel: [<ffffffff810972d0>] ? kthread_create_on_node+0x140/0x140 Jul 7 02:25:39 node0 kernel: [<ffffffff8161497c>] ret_from_fork+0x7c/0xb0 Jul 7 02:25:39 node0 kernel: [<ffffffff810972d0>] ? kthread_create_on_node+0x140/0x140 Jul 7 02:25:39 node0 kernel: INFO: task xfsaild/rbd5:2914 blocked for more than 120 seconds. Does anyone encounter the same problem or could help with this? Thanks.
|
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com