Hi Nick,
Thank you for your feedback. The cache tiers was fine. We identified some packet loss between two switches. As usual with network, relatively easy to identify but not something that comes to mind at first :)
Adrien
On Thu, Mar 17, 2016 at 2:32 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Adrien Gillard
> Sent: 17 March 2016 10:23
> To: ceph-users <ceph-users@xxxxxxxx>
> Subject: RBD hanging on some volumes of a pool
>
> Hi,
>
> I am facing issues with some of my rbd volumes since yesterday. Some of
> them completely hang at some point before eventually resuming IO, may it
> be a few minutes or several hours later.
>
> First and foremost, my setup : I already detailed it on the mailing list [0][1].
> Some changes have been made : the 3 monitors are now VM and we are
> trying kernel 4.4.5 on the clients (cluster is still 3.10 centos7).
>
> Using EC pools, I already had some trouble with RBD features not supported
> by EC [2] and changed min_recency_* to 0 about 2 weeks ago to avoid the
> hassle. Everything has been working pretty smoothly since.
>
> All my volumes (currently 5) are on an EC pool with writeback cache. Two of
> them are perfectly fine. On the other 3, different story : doing IO is
> impossible, if I start a simple copy I get a new file of a few dozen MB (or
> sometimes 0) then it hangs. Doing dd with direct and sync flags has the same
> behaviour.
I can only guess that you are having problems with your cache tier not flushing and so writes are stalling on waiting for space to become available. Can you post
ceph osd dump | grep pool
and
ceph df detail
>
> I tried witching back to 3.10, no changes, on the client I rebooted I currently
> cannot mount the filesystem, mount hangs (the volume seems correctly
> mapped however).
>
> strace on the cp command freezes in the middle of a read :
>
> 11:17:56 write(4,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 65536) = 65536
> 11:17:56 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 65536) = 65536
> 11:17:56 write(4,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 65536) = 65536
> 11:17:56 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 65536) = 65536
> 11:17:56 write(4,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 65536) = 65536
> 11:17:56 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 65536) = 65536
> 11:17:56 write(4,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 65536) = 65536
> 11:17:56 read(3,
>
>
> I tried to bump up the logging but I don't really know what to look for exactly
> and didn't see anything obvious.
>
> Any input or lead on how to debug this would be highly appreciated :)
>
> Adrien
>
> [0] http://www.spinics.net/lists/ceph-users/msg23990.html
> [1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-
> January/007004.html
> [2] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-
> February/007746.html
>
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com