Hi List, I have a ceph cluster setup with two networks, one for public traffic and one for cluster traffic. Network failures in the public network are handled quite well, but network failures in the cluster network are handled very badly. I found several discussions on the ml about this topic and they stated that the problem should be fixed, but I still have problems. I use ceph v0.86 with a standard crushmap, 4 osds per host and 6 hosts in the root default therefore I have 24 osds overall. Each storage node has 2 10Gbit nics one for public and one for cluster traffic, if I take down ONE of the links in the cluster network the cluster stops working. I tested it several times and I could observe following different behaviors. 1. Cluster stops forever. 2. After a timeout of around 120 seconds all other osds gets marked down. The osds on the storage node with the link failure stays up. Then all other osds boot and come back and the osds on the node with the failure are marked down and the cluster starts to work again. 3. After a timeout of around 120 seconds the osds on the node with the link failure gets marked down and the cluster starts to work again. Therefore a link failure in the cluster network has a very severe impact on the cluster availability. Is this a configuration mistake on my side? Any help would be greatly appreciated. Best Regards, martin _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com