Re: Ceph network topology with redundant switches

Tim Bishop <tim-lists@xxxxxxxxxxx> · Fri, 20 Dec 2013 17:29:58 +0000

Hi Wido,

Thanks for the reply.

On Fri, Dec 20, 2013 at 08:14:13AM +0100, Wido den Hollander wrote:
> On 12/18/2013 09:39 PM, Tim Bishop wrote:
> > I'm investigating and planning a new Ceph cluster starting with 6
> > nodes with currently planned growth to 12 nodes over a few years. Each
> > node will probably contain 4 OSDs, maybe 6.
> >
> > The area I'm currently investigating is how to configure the
> > networking. To avoid a SPOF I'd like to have redundant switches for
> > both the public network and the internal network, most likely running
> > at 10Gb. I'm considering splitting the nodes in to two separate racks
> > and connecting each half to its own switch, and then trunk the
> > switches together to allow the two halves of the cluster to see each
> > other. The idea being that if a single switch fails I'd only lose half
> > of the cluster.
> 
> Why not three switches in total and use VLANs on the switches to 
> separate public/cluster traffic?
> 
> This way you can configure the CRUSH map to have one replica go to each 
> "switch" so that when you loose a switch you still have two replicas 
> available.
> 
> Saves you a lot of switches and makes the network simpler.

I was planning to use VLANs to separate the public and cluster traffic
on the same switches.

Two switches costs less than three switches :-) I think on a slightly
larger scale cluster it might make more sense to go up to three (or even
more) switches, but I'm not sure the extra cost is worth it at this
level. I was planning two switches, using VLANs to separate the public
and cluster traffic, and connecting half of the cluster to each switch.

> > (I'm not touching on the required third MON in a separate location and
> > the CRUSH rules to make sure data is correctly replicated - I'm happy
> > with the setup there)
> >
> > To allow consumers of Ceph to see the full cluster they'd be directly
> > connected to both switches. I could have another layer of switches for
> > them and interlinks between them, but I'm not sure it's worth it on
> > this sort of scale.
> >
> > My question is about configuring the public network. If it's all one
> > subnet then the clients consuming the Ceph resources can't have both
> > links active, so they'd be configured in an active/standby role. But
> > this results in quite heavy usage of the trunk between the two
> > switches when a client accesses nodes on the other switch than the one
> > they're actively connected to.
> >
> 
> Why can't the clients have both links active? You could use LACP? Some 
> switches support mlag to span LACP trunks over two switches.
> 
> Or use some intelligent bonding mode in the Linux kernel.

I've only ever used LACP to the same switch, and I hadn't realised there
were options for spanning LACP links across multiple switches. Thanks
for the information there.

> > So, can I configure multiple public networks? I think so, based on the
> > documentation, but I'm not completely sure. Can I have one half of the
> > cluster on one subnet, and the other half on another? And then the
> > client machine can have interfaces in different subnets and "do the
> > right thing" with both interfaces to talk to all the nodes. This seems
> > like a fairly simple solution that avoids a SPOF in Ceph or the network
> > layer.
> 
> There is no restriction on the IPs of the OSDs. All they need is a Layer 
> 3 route to the WHOLE cluster and monitors.
> 
> Say doesn't have to be in a Layer 2 network, everything can be simply 
> Layer 3. You just have to make sure all the nodes can reach each other.

Thanks, that makes sense and makes planning simpler. I suppose it's
logical really... in a HUGE cluster you'd probably have a whole manner
of networks spread around the datacenter.

> > Or maybe I'm missing an alternative that would be better? I'm aiming
> > for something that keeps things as simple as possible while meeting
> > the redundancy requirements.
> >
>                client
>                  |
>                  |
>            core switch
>         /        |     \
>        /         |      \
>       /          |       \
>      /           |        \
>     /            |         \
> switch1      switch2     switch3
>     |            |          |
>    OSD          OSD       OSD
> 
> 
> You could build something like that. That would be fairly simple.

Isn't the core switch in that diagram a SPOF? Or is it presumed to
already be a redundant setup?

> Keep in mind that you can always loose a switch and still keep I/O going.
> 
> Wido

Thanks for your help. You answered my main point about IP addressing on
the public side, and gave me some other stuff to think about.

Tim.

> > As an aside, there's a similar issue on the cluster network side with
> > heavy traffic on the trunk between the two cluster switches. But I
> > can't see that's avoidable, and presumably it's something people just
> > have to deal with in larger Ceph installations?
> >
> > Finally, this is all theoretical planning to try and avoid designing
> > in bottlenecks at the outset. I don't have any concrete ideas of
> > loading so in practice none of it may be an issue.
> >
> > Thanks for your thoughts.
> >
> > Tim.
> 
> -- 
> Wido den Hollander
> 42on B.V.
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on

-- 
Tim Bishop
http://www.bishnet.net/tim/
PGP Key: 0x6C226B37FDF38D55

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com