Handling of network failures in the cluster network

Martin Mailand <martin@xxxxxxxxxxxx> · Mon, 13 Oct 2014 20:32:56 +0200

Hi List,

I have a ceph cluster setup with two networks, one for public traffic
and one for cluster traffic.
Network failures in the public network are handled quite well, but
network failures in the cluster network are handled very badly.

I found several discussions on the ml about this topic and they stated
that the problem should be fixed, but I still have problems.

I use ceph v0.86 with a standard crushmap, 4 osds per host and 6 hosts
in the root default therefore I have 24 osds overall.
Each storage node has 2 10Gbit nics one for public and one for cluster
traffic, if I take down ONE of the links in the cluster network the
cluster stops working.

I tested it several times and I could observe following different behaviors.

1. Cluster stops forever.

2. After a timeout of around 120 seconds all other osds gets marked
down. The osds on the storage node with the link failure stays up. Then
all other osds boot and come back and the osds on the node with the
failure are marked down and the cluster starts to work again.

3. After a timeout of around 120 seconds the osds on the node with the
link failure gets marked down and the cluster starts to work again.

Therefore a link failure in the cluster network has a very severe impact
on the cluster availability.

Is this a configuration mistake on my side?

Any help would be greatly appreciated.

Best Regards,
 martin

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com