> The area I'm currently investigating is how to configure the > networking. To avoid a SPOF I'd like to have redundant switches for > both the public network and the internal network, most likely running > at 10Gb. I'm considering splitting the nodes in to two separate racks > and connecting each half to its own switch, and then trunk the > switches together to allow the two halves of the cluster to see each > other. The idea being that if a single switch fails I'd only lose half > of the cluster. This is fine if you are using a replication factor of 2, you would need 2/3 of the cluster surviving if using a replication factor 3 with "osd pool default min size" set to 2. > My question is about configuring the public network. If it's all one > subnet then the clients consuming the Ceph resources can't have both > links active, so they'd be configured in an active/standby role. But > this results in quite heavy usage of the trunk between the two > switches when a client accesses nodes on the other switch than the one > they're actively connected to. The linux bonding driver supports several strategies for teaming network adapters on L2 networks. > So, can I configure multiple public networks? I think so, based on the > documentation, but I'm not completely sure. Can I have one half of the > cluster on one subnet, and the other half on another? And then the > client machine can have interfaces in different subnets and "do the > right thing" with both interfaces to talk to all the nodes. This seems > like a fairly simple solution that avoids a SPOF in Ceph or the network > layer. You can have multiple networks for both the public and cluster networks, the only restriction is that all subnets for a given type be within the same supernet. For example 10.0.0.0/16 - Public supernet (configured in ceph.conf) 10.0.1.0/24 - Public rack 1 10.0.2.0/24 - Public rack 2 10.1.0.0/16 - Cluster supernet (configured in ceph.conf) 10.1.1.0/24 - Cluster rack 1 10.1.2.0/24 - Cluster rack 2 > Or maybe I'm missing an alternative that would be better? I'm aiming > for something that keeps things as simple as possible while meeting > the redundancy requirements. > > As an aside, there's a similar issue on the cluster network side with > heavy traffic on the trunk between the two cluster switches. But I > can't see that's avoidable, and presumably it's something people just > have to deal with in larger Ceph installations? A proper CRUSH configuration is going to place a replica on a node in each rack, this means every write is going to cross the trunk. Other traffic that you will see on the trunk: * OSDs gossiping with one another * OSD/Monitor traffic in the case where an OSD is connected to a monitor connected in the adjacent rack (map updates, heartbeats). * OSD/Client traffic where the OSD and client are in adjacent racks If you use all 4 40GbE uplinks (common on 10GbE ToR) then your cluster level bandwidth is oversubscribed 4:1. To lower oversubscription you are going to have to steal some of the other 48 ports, 12 for 2:1 and 24 for a non-blocking fabric. Given number of nodes you have/plan to have you will be utilizing 6-12 links per switch, leaving you with 12-18 links for clients on a non-blocking fabric, 24-30 for 2:1 and 36-48 for 4:1. -- Kyle _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com