Re: Redundant networks in Ceph

Nick Fisk <nick@xxxxxxxxxx> · Sat, 27 Jun 2015 19:30:34 +0100

Hi Alex,

I think the answer is you do 1 of 2 things. You either design your network
so that it is fault tolerant in every way so that network interruption is
not possible. Or go with non-redundant networking, but design your crush map
around the failure domains of the network.

I'm interested in your example of where OSD's where unable to communicate.
What happened? Would it possible to redesign the network to stop this
happening?

Nick

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Alex Gorbachev
> Sent: 27 June 2015 19:02
> To: ceph-users@xxxxxxxxxxxxxx
> Subject:  Redundant networks in Ceph
> 
> The current network design in Ceph
> (http://ceph.com/docs/master/rados/configuration/network-config-ref)
> uses nonredundant networks for both cluster and public communication.
> Ideally, in a high load environment these will be 10 or 40+ GbE networks.
For
> cost reasons, most such installation will use the same switch hardware and
> separate Ceph traffic using VLANs.
> 
> Networking in complex, and situations are possible when switches and
> routers drop traffic.  We ran into one of those at one of our sites -
> connections to hosts stay up (so bonding NICs does not help), yet OSD
> communication gets disrupted, client IO hangs and failures cascade to
client
> applications.
> 
> My understanding is that if OSDs cannot connect for some time over the
> cluster network, that IO will hang and time out.  The document states "
> 
> If you specify more than one IP address and subnet mask for either the
> public or the cluster network, the subnets within the network must be
> capable of routing to each other."
> 
> Which in real world means complicated Layer 3 setup for routing and is not
> practical in many configurations.
> 
> What if there was an option for "cluster 2" and "public 2" networks, to
which
> OSDs and MONs would go either in active/backup or active/active mode
> (cluster 1 and cluster 2 exist separately do not route to each other)?
> 
> The difference between this setup and bonding is that here decision to
fail
> over and try the other network is at OSD/MON level, and it bring
resilience to
> faults within the switch core, which is really only detectable at
application
> layer.
> 
> Am I missing an already existing feature?  Please advise.
> 
> Best regards,
> Alex Gorbachev
> Intelligent Systems Services Inc.
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com