Hi everyone, Input welcome on an interesting proposal that came up with a user that prefers to use seperate networks to each of their top-of-rack switches instead of bonding. They would have, for example, 4 TOR switches, each with their own IP network, two public and two private. The motivation is, presumably, the cornucopia of problems one encounters with bonding and various switch vendors, firmwares, and opportunities for user misconfiguration. (I've seen my share of broken bonding setups and they are a huge headache to diagnose, and it's usually an even bigger headache to convince the network ops folks that it's their problem.) My understanding is that in order for this to work each Ceph daemon would need to bind to two address (or four, in the case of the OSD) instead of just one. These addresses would need to be shared throughout the system (in the OSDMap etc), and then when a connection is being made, we would round-robin connection attempts across them. In theory the normal connection retry should make it "just work," provided we can tolerate the connection timeout/latency when we encounter a bad network. The new addrvec code that is going in for nautilus can (in principle) handle the multiple addresses for each daemon. The main changes would be (1) defining a configuration model that tells daemons to bind to multiple networks (and which networks to bind to) and (2) the messenger change to round-robin across available addresses. And (3) some hackery so that our QA can cover the relevant messenger code even though we don't have multiple networks (probably including a made-up network everywhere would do the trick... we'd round-robin across it and it would always fail). Stepping back, though, I think the bigger question is: is this a good idea? My first reaction to this was that bonding and multipath in the network is a problem for the network, and the fact that the network vendors seem to regularly screw this up isn't a very compelling reason to think that we'd do a better job than they do. On the other hand, it seems possible to handle this case without too much additional code, and the reality seems to be that the network frequently *does* tend to screw it up. Anecdotally I'm told some other storage products do this, but I have a feeling they do it in the sense that if you're using iSCSI you can just define target addresses on both networks and the normal iSCSI multipath does its thing (perfectly, I'm sure). Thoughts? sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html