Re: Ceph network topology with redundant switches

Tim Bishop <tim-lists@xxxxxxxxxxx> · Mon, 23 Dec 2013 13:54:41 +0000

On Fri, Dec 20, 2013 at 09:26:35AM -0800, Kyle Bader wrote:
> > The area I'm currently investigating is how to configure the
> > networking. To avoid a SPOF I'd like to have redundant switches for
> > both the public network and the internal network, most likely running
> > at 10Gb. I'm considering splitting the nodes in to two separate racks
> > and connecting each half to its own switch, and then trunk the
> > switches together to allow the two halves of the cluster to see each
> > other. The idea being that if a single switch fails I'd only lose half
> > of the cluster.
> 
> This is fine if you are using a replication factor of 2, you would
> need 2/3 of the cluster surviving if using a replication factor 3 with
> "osd pool default min size" set to 2.

Ah! Thanks for pointing that out. I'd not appreciated the impact of that
setting. I'll have to consider my options here. This also fits well with
Wido's reply (in this same thread) about splitting the nodes in to three
groups rather than two, although the cost of 10Gb networking starts to
become prohibitive at that point.

> > My question is about configuring the public network. If it's all one
> > subnet then the clients consuming the Ceph resources can't have both
> > links active, so they'd be configured in an active/standby role. But
> > this results in quite heavy usage of the trunk between the two
> > switches when a client accesses nodes on the other switch than the one
> > they're actively connected to.
> 
> The linux bonding driver supports several strategies for teaming network
> adapters on L2 networks.

Across switches? Wido's reply mentions using mlag to span LACP trunks
across switches. This isn't something I'd seen before, so I'd assumed I
couldn't do it. Certainly an area I need to look in to more.

> > So, can I configure multiple public networks? I think so, based on the
> > documentation, but I'm not completely sure. Can I have one half of the
> > cluster on one subnet, and the other half on another? And then the
> > client machine can have interfaces in different subnets and "do the
> > right thing" with both interfaces to talk to all the nodes. This seems
> > like a fairly simple solution that avoids a SPOF in Ceph or the network
> > layer.
> 
> You can have multiple networks for both the public and cluster networks,
> the only restriction is that all subnets for a given type be within the same
> supernet. For example
> 
> 10.0.0.0/16 - Public supernet (configured in ceph.conf)
> 10.0.1.0/24 - Public rack 1
> 10.0.2.0/24 - Public rack 2
> 10.1.0.0/16 - Cluster supernet (configured in ceph.conf)
> 10.1.1.0/24 - Cluster rack 1
> 10.1.2.0/24 - Cluster rack 2

Thanks, that clarifies how this works.

> > As an aside, there's a similar issue on the cluster network side with
> > heavy traffic on the trunk between the two cluster switches. But I
> > can't see that's avoidable, and presumably it's something people just
> > have to deal with in larger Ceph installations?
> 
> A proper CRUSH configuration is going to place a replica on a node in
> each rack, this means every write is going to cross the trunk. Other
> traffic that you will see on the trunk:
> 
> * OSDs gossiping with one another
> * OSD/Monitor traffic in the case where an OSD is connected to a
>   monitor connected in the adjacent rack (map updates, heartbeats).

Am I right in say that the first of these happens over the cluster
network and the second over the public network? It looks like monitors
don't have a cluster network address.

> * OSD/Client traffic where the OSD and client are in adjacent racks
> 
> If you use all 4 40GbE uplinks (common on 10GbE ToR) then your
> cluster level bandwidth is oversubscribed 4:1. To lower oversubscription
> you are going to have to steal some of the other 48 ports, 12 for 2:1 and
> 24 for a non-blocking fabric. Given number of nodes you have/plan to
> have you will be utilizing 6-12 links per switch, leaving you with 12-18
> links for clients on a non-blocking fabric, 24-30 for 2:1 and 36-48 for 4:1.

I guess this is an area where instrumentation can be used to measure the
load on the trunk and add more links if required.

Thank you for your help.

Tim.

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com