Hi, Ilya, In the dmesg, there is also a lot of libceph socket error, which I think may be caused by my stopping ceph service without unmap rbd. Thanks for willing to help.
On Tue, Jul 28, 2015 at 11:19 AM, van <chaofanyu@xxxxxxxxxxx> wrote:Hi, Ilya,
Thanks for your quick reply.
Here is the link http://ceph.com/docs/cuttlefish/faq/ , under the "HOW CAN I GIVE CEPH A TRY?” section which talk about the old kernel stuff.
By the way, what’s the main reason of using kernel 4.1, is there a lot of critical bugs fixed in that version despite perf improvements? I am worrying kernel 4.1 is too new that may introduce other problems.
Well, I'm not sure what exactly is in 3.10.0.229, so I can't tell youoff hand. I can think of one important memory pressure related fixthat's probably not in there.I'm suggesting the latest stable version of 4.1 (currently 4.1.3),because if you hit a deadlock (remember, this is a configuration thatis neither recommended nor guaranteed to work), it'll be easier todebug and fix if the fix turns out to be worth it.If 4.1 is not acceptable for you, try the latest stable version of 3.18(that is 3.18.19). It's an LTS kernel, so that should mitigate some ofyour concerns. And if I’m using the librdb API, is the kernel version matters?
No, not so much. In my tests, I built a 2-nodes cluster, each with only one OSD with os centos 7.1, kernel version 3.10.0.229 and ceph v0.94.2. I created several rbds and mkfs.xfs on those rbds to create filesystems. (kernel client were running on the ceph cluster) I performed heavy IO tests on those filesystems and found some fio got hung and turned into D state forever (uninterruptible sleep). I suspect it’s the deadlock that make the fio process hung. However the ceph-osd are stil responsive, and I can operate rbd via librbd API. Does this mean it’s not the loopback mount deadlock that cause the fio process hung? Or it is also a deadlock phnonmenon, only one thread is blocked in memory allocation and other threads are still possible to receive API requests, so the ceph-osd are still responsive?
What worth mentioning is that after I restart the ceph-osd daemon, all processes in D state come back into normal state.
Below is related log in kernel:
Jul 7 02:25:39 node0 kernel: INFO: task xfsaild/rbd1:24795 blocked for more than 120 seconds. Jul 7 02:25:39 node0 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 7 02:25:39 node0 kernel: xfsaild/rbd1 D ffff880c2fc13680 0 24795 2 0x00000080 Jul 7 02:25:39 node0 kernel: ffff8801d6343d40 0000000000000046 ffff8801d6343fd8 0000000000013680 Jul 7 02:25:39 node0 kernel: ffff8801d6343fd8 0000000000013680 ffff880c0c0b0000 ffff880c0c0b0000 Jul 7 02:25:39 node0 kernel: ffff880c2fc14340 0000000000000001 0000000000000000 ffff8805bace2528 Jul 7 02:25:39 node0 kernel: Call Trace: Jul 7 02:25:39 node0 kernel: [<ffffffff81609e39>] schedule+0x29/0x70 Jul 7 02:25:39 node0 kernel: [<ffffffffa03a1890>] _xfs_log_force+0x230/0x290 [xfs] Jul 7 02:25:39 node0 kernel: [<ffffffff810a9620>] ? wake_up_state+0x20/0x20 Jul 7 02:25:39 node0 kernel: [<ffffffffa03a1916>] xfs_log_force+0x26/0x80 [xfs] Jul 7 02:25:39 node0 kernel: [<ffffffffa03a6390>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] Jul 7 02:25:39 node0 kernel: [<ffffffffa03a64e1>] xfsaild+0x151/0x5e0 [xfs] Jul 7 02:25:39 node0 kernel: [<ffffffffa03a6390>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs] Jul 7 02:25:39 node0 kernel: [<ffffffff8109739f>] kthread+0xcf/0xe0 Jul 7 02:25:39 node0 kernel: [<ffffffff810972d0>] ? kthread_create_on_node+0x140/0x140 Jul 7 02:25:39 node0 kernel: [<ffffffff8161497c>] ret_from_fork+0x7c/0xb0 Jul 7 02:25:39 node0 kernel: [<ffffffff810972d0>] ? kthread_create_on_node+0x140/0x140 Jul 7 02:25:39 node0 kernel: INFO: task xfsaild/rbd5:2914 blocked for more than 120 seconds.
Is that all there is in dmesg? Can you paste the entire dmesg?Thanks, Ilya
|