Hello, On Mon, 29 Aug 2016 16:16:11 -0700 Eric Kolb wrote: > Hello, > > Have read a few items about what occurs if the back-end cluster switch > were to fail or be rebooted due to code updates. From the > Troubleshooting OSDs guide > (http://docs.ceph.com/docs/jewel/rados/troubleshooting/troubleshooting-osd/) > it states, "if the cluster (back-end) network fails or develops > significant latency while the public (front-end) network operates > optimally, OSDs currently do not handle this situation well". > That's putting it very nicely. > May someone have any experience with this scenario they may be able to > pass along? > No personal experience, as I strive to avoid that scenario at all costs. If the "few items" you read contained mails by me, then the following will sound familiar: 1. Why split the network in the first place? >From a bandwidth perspective, it only makes sense if your OSDs can write faster than the combined bandwidth. If you're thinking about segregating the networks for policy reasons, still use a unified network but with VLANs. 2. Avoid failures. Since you're already looking at (at least) 2 network interfaces, avoid a node loss due to interface or switch failures entirely. Either by using Active-Standby failover (less bandwidth, but the cheapest switches will do) or the more advantageous LACP with MC-LAG switches (full bandwidth if both switches are up, still one link of BW if one goes down). The later service level can also be achieved by routing (OSPF/BGP) on the hosts, something that was discussed in here as well. It's more involved, but can use cheap switches as well. Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com