Re: RFC: ceph daemon multi-homing

Sage Weil <sweil@xxxxxxxxxx> · Fri, 6 Jul 2018 23:07:57 +0000 (UTC)

On Fri, 6 Jul 2018, Michael Lowe wrote:
> My first thought is not this use case but could it be a way to have dual 
> stack ip v4/v6?

This is already coming in nautilus.  The addrvec work that will enable 
dual stack (and msgr2) is the same thing that would enable this.  The 
first pieces of this are already in master.

sage

> 
> Sent from my iPad
> 
> > On Jul 6, 2018, at 6:41 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> > 
> > Hi everyone,
> > 
> > Input welcome on an interesting proposal that came up with a user that 
> > prefers to use seperate networks to each of their top-of-rack switches 
> > instead of bonding.  They would have, for example, 4 TOR switches, each 
> > with their own IP network, two public and two private.  The motivation is, 
> > presumably, the cornucopia of problems one encounters with bonding and 
> > various switch vendors, firmwares, and opportunities for user 
> > misconfiguration.  (I've seen my share of broken bonding setups and they 
> > are a huge headache to diagnose, and it's usually an even bigger headache 
> > to convince the network ops folks that it's their problem.)
> > 
> > My understanding is that in order for this to work each Ceph daemon would 
> > need to bind to two address (or four, in the case of the OSD) instead of 
> > just one.  These addresses would need to be shared throughout the system 
> > (in the OSDMap etc), and then when a connection is being made, we would 
> > round-robin connection attempts across them.  In theory the normal 
> > connection retry should make it "just work," provided we can tolerate the 
> > connection timeout/latency when we encounter a bad network.
> > 
> > The new addrvec code that is going in for nautilus can (in principle) 
> > handle the multiple addresses for each daemon.  The main changes would be 
> > (1) defining a configuration model that tells daemons to bind to multiple 
> > networks (and which networks to bind to) and (2) the messenger change to 
> > round-robin across available addresses.  And (3) some hackery so that our 
> > QA can cover the relevant messenger code even though we don't have 
> > multiple networks (probably including a made-up network everywhere would 
> > do the trick... we'd round-robin across it and it would always fail).
> > 
> > Stepping back, though, I think the bigger question is: is this a good 
> > idea?  My first reaction to this was that bonding and multipath in the 
> > network is a problem for the network, and the fact that the network 
> > vendors seem to regularly screw this up isn't a very compelling reason to 
> > think that we'd do a better job than they do.  On the other hand, it seems 
> > possible to handle this case without too much additional code, and the 
> > reality seems to be that the network frequently *does* tend to screw it 
> > up.
> > 
> > Anecdotally I'm told some other storage products do this, but I have a 
> > feeling they do it in the sense that if you're using iSCSI you can just 
> > define target addresses on both networks and the normal iSCSI multipath 
> > does its thing (perfectly, I'm sure).
> > 
> > Thoughts?
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html