On Mon, Jul 9, 2018 at 6:35 AM, John Spray <jspray@xxxxxxxxxx> wrote: > On Fri, Jul 6, 2018 at 11:41 PM Sage Weil <sweil@xxxxxxxxxx> wrote: >> >> Hi everyone, >> >> Input welcome on an interesting proposal that came up with a user that >> prefers to use seperate networks to each of their top-of-rack switches >> instead of bonding. They would have, for example, 4 TOR switches, each >> with their own IP network, two public and two private. The motivation is, >> presumably, the cornucopia of problems one encounters with bonding and >> various switch vendors, firmwares, and opportunities for user >> misconfiguration. (I've seen my share of broken bonding setups and they >> are a huge headache to diagnose, and it's usually an even bigger headache >> to convince the network ops folks that it's their problem.) >> >> My understanding is that in order for this to work each Ceph daemon would >> need to bind to two address (or four, in the case of the OSD) instead of >> just one. These addresses would need to be shared throughout the system >> (in the OSDMap etc), and then when a connection is being made, we would >> round-robin connection attempts across them. In theory the normal >> connection retry should make it "just work," provided we can tolerate the >> connection timeout/latency when we encounter a bad network. >> >> The new addrvec code that is going in for nautilus can (in principle) >> handle the multiple addresses for each daemon. The main changes would be >> (1) defining a configuration model that tells daemons to bind to multiple >> networks (and which networks to bind to) and (2) the messenger change to >> round-robin across available addresses. And (3) some hackery so that our >> QA can cover the relevant messenger code even though we don't have >> multiple networks (probably including a made-up network everywhere would >> do the trick... we'd round-robin across it and it would always fail). >> >> Stepping back, though, I think the bigger question is: is this a good >> idea? > >> My first reaction to this was that bonding and multipath in the >> network is a problem for the network > > Same here! > >> On the other hand, it seems >> possible to handle this case without too much additional code, and the >> reality seems to be that the network frequently *does* tend to screw it >> up. > > The QA/maintenance is my worry -- a mock network in the test config > wouldn't really be exercising it realistically. In a year or two we'd > probably start to be uncertain about whether anyone out there was > really using it, as well, unless it became more widely popular. Indeed yes. We’ve done a bad enough job handling disparate failures in the public and private networks. Trying to deal with multiple binding on separate public networks...ugh. I would discourage this unless somebody has a very compelling use case and we can come up with a viable test implementation before we merge it. -Greg > > Perhaps it should depend on whether we have a developer/vendor who > wants to do some regular pre-release testing in their own environment > for the forseeable future: if so, let's be accommodating. If not, it > seems like a bridge too far. > > John > >> >> Anecdotally I'm told some other storage products do this, but I have a >> feeling they do it in the sense that if you're using iSCSI you can just >> define target addresses on both networks and the normal iSCSI multipath >> does its thing (perfectly, I'm sure). >> >> Thoughts? >> sage >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html