On Tue, Jul 28, 2015 at 7:20 PM, van <chaofanyu@xxxxxxxxxxx> wrote:
On Jul 28, 2015, at 7:57 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
On Tue, Jul 28, 2015 at 2:46 PM, van <chaofanyu@xxxxxxxxxxx> wrote:
Hi, Ilya,
In the dmesg, there is also a lot of libceph socket error, which I think may be caused by my stopping ceph service without unmap rbd.
Well, sure enough, if you kill all OSDs, the filesystem mounted on top of rbd device will get stuck.
Sure it will get stuck if osds are stopped. And since rados requests have retry policy, the stucked requests will recover after I start the daemon again.
But in my case, the osds are running in normal state and librbd API can read/write normally. Meanwhile, heavy fio test for the filesystem mounted on top of rbd device will get stuck.
I wonder if this phenomenon is triggered by running rbd kernel client on machines have ceph daemons, i.e. the annoying loopback mount deadlock issue.
In my opinion, if it’s due to the loopback mount deadlock, the OSDs will become unresponsive. No matter the requests are from user space requests (like API) or from kernel client. Am I right?
Not necessarily. If so, my case seems to be triggered by another bug.
Anyway, it seems that I should separate client and daemons at least.
Try 3.18.19 if you can. I'd be interested in your results.
It’s strange, after I drop the page cache and restart my OSDs, same heavy IO tests on rbd folder now works fine. The deadlock seems not that easy to trigger. Maybe I need longer tests.
I’ll try 3.18.19 LTS, thanks. Thanks,
Ilya
|