Re: RFC: ceph daemon multi-homing

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 9 Jul 2018 07:45:51 -0700



On Mon, Jul 9, 2018 at 6:35 AM, John Spray <jspray@xxxxxxxxxx> wrote:
> On Fri, Jul 6, 2018 at 11:41 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
>>
>> Hi everyone,
>>
>> Input welcome on an interesting proposal that came up with a user that
>> prefers to use seperate networks to each of their top-of-rack switches
>> instead of bonding.  They would have, for example, 4 TOR switches, each
>> with their own IP network, two public and two private.  The motivation is,
>> presumably, the cornucopia of problems one encounters with bonding and
>> various switch vendors, firmwares, and opportunities for user
>> misconfiguration.  (I've seen my share of broken bonding setups and they
>> are a huge headache to diagnose, and it's usually an even bigger headache
>> to convince the network ops folks that it's their problem.)
>>
>> My understanding is that in order for this to work each Ceph daemon would
>> need to bind to two address (or four, in the case of the OSD) instead of
>> just one.  These addresses would need to be shared throughout the system
>> (in the OSDMap etc), and then when a connection is being made, we would
>> round-robin connection attempts across them.  In theory the normal
>> connection retry should make it "just work," provided we can tolerate the
>> connection timeout/latency when we encounter a bad network.
>>
>> The new addrvec code that is going in for nautilus can (in principle)
>> handle the multiple addresses for each daemon.  The main changes would be
>> (1) defining a configuration model that tells daemons to bind to multiple
>> networks (and which networks to bind to) and (2) the messenger change to
>> round-robin across available addresses.  And (3) some hackery so that our
>> QA can cover the relevant messenger code even though we don't have
>> multiple networks (probably including a made-up network everywhere would
>> do the trick... we'd round-robin across it and it would always fail).
>>
>> Stepping back, though, I think the bigger question is: is this a good
>> idea?
>
>> My first reaction to this was that bonding and multipath in the
>> network is a problem for the network
>
> Same here!
>
>> On the other hand, it seems
>> possible to handle this case without too much additional code, and the
>> reality seems to be that the network frequently *does* tend to screw it
>> up.
>
> The QA/maintenance is my worry -- a mock network in the test config
> wouldn't really be exercising it realistically.  In a year or two we'd
> probably start to be uncertain about whether anyone out there was
> really using it, as well, unless it became more widely popular.

Indeed yes. We’ve done a bad enough job handling disparate failures in
the public and private networks. Trying to deal with multiple binding
on separate public networks...ugh. I would discourage this unless
somebody has a very compelling use case and we can come up with a
viable test implementation before we merge it.
-Greg

>
> Perhaps it should depend on whether we have a developer/vendor who
> wants to do some regular pre-release testing in their own environment
> for the forseeable future: if so, let's be accommodating.  If not, it
> seems like a bridge too far.
>
> John
>
>>
>> Anecdotally I'm told some other storage products do this, but I have a
>> feeling they do it in the sense that if you're using iSCSI you can just
>> define target addresses on both networks and the normal iSCSI multipath
>> does its thing (perfectly, I'm sure).
>>
>> Thoughts?
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html