Re: RBD hanging on some volumes of a pool

Nick Fisk <nick@xxxxxxxxxx> · Thu, 17 Mar 2016 13:32:15 -0000

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Adrien Gillard
> Sent: 17 March 2016 10:23
> To: ceph-users <ceph-users@xxxxxxxx>
> Subject:  RBD hanging on some volumes of a pool
> 
> Hi,
> 
> I am facing issues with some of my rbd volumes since yesterday. Some of
> them completely hang at some point before eventually resuming IO, may it
> be a few minutes or several hours later.
> 
> First and foremost, my setup : I already detailed it on the mailing list [0][1].
> Some changes have been made : the 3 monitors are now VM and we are
> trying kernel 4.4.5 on the clients (cluster is still 3.10 centos7).
> 
> Using EC pools, I already had some trouble with RBD features not supported
> by EC [2] and changed min_recency_* to 0 about 2 weeks ago to avoid the
> hassle. Everything has been working pretty smoothly since.
> 
> All my volumes (currently 5) are on an EC pool with writeback cache. Two of
> them are perfectly fine. On the other 3, different story : doing IO is
> impossible, if I start a simple copy I get a new file of a few dozen MB (or
> sometimes 0) then it hangs. Doing dd with direct and sync flags has the same
> behaviour.

I can only guess that you are having problems with your cache tier not flushing and so writes are stalling on waiting for space to become available. Can you post 

ceph osd dump | grep pool

and 

ceph df detail

> 
> I tried witching back to 3.10, no changes, on the client I rebooted I currently
> cannot mount the filesystem, mount hangs (the volume seems correctly
> mapped however).
> 
> strace on the cp command freezes in the middle of a read :
> 
> 11:17:56 write(4,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 65536) = 65536
> 11:17:56 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 65536) = 65536
> 11:17:56 write(4,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 65536) = 65536
> 11:17:56 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 65536) = 65536
> 11:17:56 write(4,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 65536) = 65536
> 11:17:56 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 65536) = 65536
> 11:17:56 write(4,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 65536) = 65536
> 11:17:56 read(3,
> 
> 
> I tried to bump up the logging but I don't really know what to look for exactly
> and didn't see anything obvious.
> 
> Any input or lead on how to debug this would be highly appreciated :)
> 
> Adrien
> 
> [0] http://www.spinics.net/lists/ceph-users/msg23990.html
> [1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-
> January/007004.html
> [2] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-
> February/007746.html
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com