RFC: ceph daemon multi-homing

Sage Weil <sweil@xxxxxxxxxx> · Fri, 6 Jul 2018 22:41:39 +0000 (UTC)

Hi everyone,

Input welcome on an interesting proposal that came up with a user that 
prefers to use seperate networks to each of their top-of-rack switches 
instead of bonding.  They would have, for example, 4 TOR switches, each 
with their own IP network, two public and two private.  The motivation is, 
presumably, the cornucopia of problems one encounters with bonding and 
various switch vendors, firmwares, and opportunities for user 
misconfiguration.  (I've seen my share of broken bonding setups and they 
are a huge headache to diagnose, and it's usually an even bigger headache 
to convince the network ops folks that it's their problem.)

My understanding is that in order for this to work each Ceph daemon would 
need to bind to two address (or four, in the case of the OSD) instead of 
just one.  These addresses would need to be shared throughout the system 
(in the OSDMap etc), and then when a connection is being made, we would 
round-robin connection attempts across them.  In theory the normal 
connection retry should make it "just work," provided we can tolerate the 
connection timeout/latency when we encounter a bad network.

The new addrvec code that is going in for nautilus can (in principle) 
handle the multiple addresses for each daemon.  The main changes would be 
(1) defining a configuration model that tells daemons to bind to multiple 
networks (and which networks to bind to) and (2) the messenger change to 
round-robin across available addresses.  And (3) some hackery so that our 
QA can cover the relevant messenger code even though we don't have 
multiple networks (probably including a made-up network everywhere would 
do the trick... we'd round-robin across it and it would always fail).

Stepping back, though, I think the bigger question is: is this a good 
idea?  My first reaction to this was that bonding and multipath in the 
network is a problem for the network, and the fact that the network 
vendors seem to regularly screw this up isn't a very compelling reason to 
think that we'd do a better job than they do.  On the other hand, it seems 
possible to handle this case without too much additional code, and the 
reality seems to be that the network frequently *does* tend to screw it 
up.

Anecdotally I'm told some other storage products do this, but I have a 
feeling they do it in the sense that if you're using iSCSI you can just 
define target addresses on both networks and the normal iSCSI multipath 
does its thing (perfectly, I'm sure).

Thoughts?
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html