RE: ceph daemon multi-homing

Adrian Saul <Adrian.Saul@xxxxxxxxxxxxxxxxx> · Mon, 9 Jul 2018 02:07:05 +0000

Would this not be better achieved by configuring loopback IP addresses on the hosts, binding the daemons to the loop back addresses and running a BGP routing agent like quagga to handle failover through routing.  Akin to the L3 TOR model some larger installs use already. Also, would this also start introducing many more cases for having to deal with race conditions with message retransmissions and lost responses etc?

The other option you could look at for achieving this might be to support SCTP and allow that layer to handle failover.  That would give you the multipoint support with probably a more reliable transport.

> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil
> Sent: Saturday, 7 July 2018 8:42 AM
> To: ceph-devel@xxxxxxxxxxxxxxx
> Subject: RFC: ceph daemon multi-homing
>
> Hi everyone,
>
> Input welcome on an interesting proposal that came up with a user that
> prefers to use seperate networks to each of their top-of-rack switches
> instead of bonding.  They would have, for example, 4 TOR switches, each
> with their own IP network, two public and two private.  The motivation is,
> presumably, the cornucopia of problems one encounters with bonding and
> various switch vendors, firmwares, and opportunities for user
> misconfiguration.  (I've seen my share of broken bonding setups and they
> are a huge headache to diagnose, and it's usually an even bigger headache to
> convince the network ops folks that it's their problem.)
>
> My understanding is that in order for this to work each Ceph daemon would
> need to bind to two address (or four, in the case of the OSD) instead of just
> one.  These addresses would need to be shared throughout the system (in
> the OSDMap etc), and then when a connection is being made, we would
> round-robin connection attempts across them.  In theory the normal
> connection retry should make it "just work," provided we can tolerate the
> connection timeout/latency when we encounter a bad network.
>
> The new addrvec code that is going in for nautilus can (in principle) handle
> the multiple addresses for each daemon.  The main changes would be
> (1) defining a configuration model that tells daemons to bind to multiple
> networks (and which networks to bind to) and (2) the messenger change to
> round-robin across available addresses.  And (3) some hackery so that our QA
> can cover the relevant messenger code even though we don't have multiple
> networks (probably including a made-up network everywhere would do the
> trick... we'd round-robin across it and it would always fail).
>
> Stepping back, though, I think the bigger question is: is this a good idea?  My
> first reaction to this was that bonding and multipath in the network is a
> problem for the network, and the fact that the network vendors seem to
> regularly screw this up isn't a very compelling reason to think that we'd do a
> better job than they do.  On the other hand, it seems possible to handle this
> case without too much additional code, and the reality seems to be that the
> network frequently *does* tend to screw it up.
>
> Anecdotally I'm told some other storage products do this, but I have a feeling
> they do it in the sense that if you're using iSCSI you can just define target
> addresses on both networks and the normal iSCSI multipath does its thing
> (perfectly, I'm sure).
>
> Thoughts?
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
> http://vger.kernel.org/majordomo-info.html
Confidentiality: This email and any attachments are confidential and may be subject to copyright, legal or some other professional privilege. They are intended solely for the attention and use of the named addressee(s). They may only be copied, distributed or disclosed with the consent of the copyright owner. If you have received this email by mistake or by breach of the confidentiality clause, please notify the sender immediately by return email and delete or destroy all copies of the email. Any confidentiality, privilege or copyright is not waived or lost because this email has been sent to you by mistake.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html