Re: Ceph network topology with redundant switches

Kyle Bader <kyle.bader@xxxxxxxxx> · Fri, 20 Dec 2013 09:26:35 -0800

> The area I'm currently investigating is how to configure the
> networking. To avoid a SPOF I'd like to have redundant switches for
> both the public network and the internal network, most likely running
> at 10Gb. I'm considering splitting the nodes in to two separate racks
> and connecting each half to its own switch, and then trunk the
> switches together to allow the two halves of the cluster to see each
> other. The idea being that if a single switch fails I'd only lose half
> of the cluster.

This is fine if you are using a replication factor of 2, you would need 2/3 of
the cluster surviving if using a replication factor 3 with "osd pool default min
size" set to 2.

> My question is about configuring the public network. If it's all one
> subnet then the clients consuming the Ceph resources can't have both
> links active, so they'd be configured in an active/standby role. But
> this results in quite heavy usage of the trunk between the two
> switches when a client accesses nodes on the other switch than the one
> they're actively connected to.

The linux bonding driver supports several strategies for teaming network
adapters on L2 networks.

> So, can I configure multiple public networks? I think so, based on the
> documentation, but I'm not completely sure. Can I have one half of the
> cluster on one subnet, and the other half on another? And then the
> client machine can have interfaces in different subnets and "do the
> right thing" with both interfaces to talk to all the nodes. This seems
> like a fairly simple solution that avoids a SPOF in Ceph or the network
> layer.

You can have multiple networks for both the public and cluster networks,
the only restriction is that all subnets for a given type be within the same
supernet. For example

10.0.0.0/16 - Public supernet (configured in ceph.conf)
10.0.1.0/24 - Public rack 1
10.0.2.0/24 - Public rack 2
10.1.0.0/16 - Cluster supernet (configured in ceph.conf)
10.1.1.0/24 - Cluster rack 1
10.1.2.0/24 - Cluster rack 2

> Or maybe I'm missing an alternative that would be better? I'm aiming
> for something that keeps things as simple as possible while meeting
> the redundancy requirements.
>
> As an aside, there's a similar issue on the cluster network side with
> heavy traffic on the trunk between the two cluster switches. But I
> can't see that's avoidable, and presumably it's something people just
> have to deal with in larger Ceph installations?

A proper CRUSH configuration is going to place a replica on a node in
each rack, this means every write is going to cross the trunk. Other
traffic that you will see on the trunk:

* OSDs gossiping with one another
* OSD/Monitor traffic in the case where an OSD is connected to a
  monitor connected in the adjacent rack (map updates, heartbeats).
* OSD/Client traffic where the OSD and client are in adjacent racks

If you use all 4 40GbE uplinks (common on 10GbE ToR) then your
cluster level bandwidth is oversubscribed 4:1. To lower oversubscription
you are going to have to steal some of the other 48 ports, 12 for 2:1 and
24 for a non-blocking fabric. Given number of nodes you have/plan to
have you will be utilizing 6-12 links per switch, leaving you with 12-18
links for clients on a non-blocking fabric, 24-30 for 2:1 and 36-48 for 4:1.

-- 

Kyle
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com