We also hit the similar issue from time to time on centos with 3.10.x kernel. By iostat, we can see kernel rbd client's util is 100%, but no r/w io, and we can't umount/unmap this rbd client. After restarting OSDs, it will become normal.
Is your rbd kernel client and ceph OSDs running on the same machine? Or you’ve encountered this problem even you separate the kernel client and ceph OSDs?
@Ilya, could you pls point us the possible fixes on 3.18.19 towards this issue? Then we can try to back-port them to our old kernel because we can't jump to a major kernel version. Thanks.
David Zhang
From: chaofanyu@xxxxxxxxxxxDate: Thu, 30 Jul 2015 10:30:12 +0800 To: idryomov@xxxxxxxxxCC: ceph-users@xxxxxxxxxxxxxxSubject: Re: which kernel version can help avoid kernel client deadlock
On Tue, Jul 28, 2015 at 7:20 PM, van <chaofanyu@xxxxxxxxxxx> wrote:
On Jul 28, 2015, at 7:57 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
On Tue, Jul 28, 2015 at 2:46 PM, van <chaofanyu@xxxxxxxxxxx> wrote:
Hi, Ilya,
In the dmesg, there is also a lot of libceph socket error, which I think may be caused by my stopping ceph service without unmap rbd.
Well, sure enough, if you kill all OSDs, the filesystem mounted on top of rbd device will get stuck.
Sure it will get stuck if osds are stopped. And since rados requests have retry policy, the stucked requests will recover after I start the daemon again.
But in my case, the osds are running in normal state and librbd API can read/write normally. Meanwhile, heavy fio test for the filesystem mounted on top of rbd device will get stuck.
I wonder if this phenomenon is triggered by running rbd kernel client on machines have ceph daemons, i.e. the annoying loopback mount deadlock issue.
In my opinion, if it’s due to the loopback mount deadlock, the OSDs will become unresponsive. No matter the requests are from user space requests (like API) or from kernel client. Am I right?
Not necessarily. If so, my case seems to be triggered by another bug.
Anyway, it seems that I should separate client and daemons at least.
Try 3.18.19 if you can. I'd be interested in your results.
It’s strange, after I drop the page cache and restart my OSDs, same heavy IO tests on rbd folder now works fine. The deadlock seems not that easy to trigger. Maybe I need longer tests.
I’ll try 3.18.19 LTS, thanks. Thanks,
Ilya
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
|