Beat Rubischon wrote:
Hello!
Quoting <gordan@xxxxxxxxxx> (13.10.10 09:11):
I would also expect to see network issues as a cluster grows.
Performance reducing as the node count increases isn't seen as a bigger
issue?
I have pretty bad experience with Multicast. Running serveral clusters in
the range 500-1000 nodes in a single broadcast domain over the last year
showed that broadcast or multicast is able to kill your fabrics easily.
What sort of a cluster are you running with that many nodes? RHCS?
Heartbeat? Something else entirely? In what arrangement?
Even the most expensive GigE switch chassis could be killed by 125+ MBytes
of traffic which is almost nothing :-)
Sounds like a typical example of cost not being a good measure of
quality and performance. :)
The inbound traffic must be routed to
several outbount ports which results in congestion. Throttling and packet
loss is the result, even for the normal unicast traffic. Multicast or
broadcast is a nice way for a denial of service in a LAN.
If you have that many GlusterFS nodes, you have DoS-ed the storage
network anyway purely by the write amplification of the inverse scaling.
The idea is that you have a (mostly) dedicated VLAN for this.
In Infiniband multicast typically realized by looping through the
destinations directly on the source HBA. One of the main target in the
current development as multicast is a similar network pattern compared to
the MPI collectives. So there is no win in using multicast over such a
fabric.
Sure, but historically in the networking space, non-ethernet
technologies have always been niche, cost ineffective in terms of
price/performance and only had a temporary performance advantage.
At the moment the only way to achieve linear-ish scaling is to connect
the nodes directly to each other, but that means that each node in an
n-way cluster has to have at least n NICs (one to each other node plus
at least one to the clients). That becomes impractical very quickly.
Using a parallel storage means you have communications from a large amount
of nodes to a large amount of servers. Using unicast looks bad in the first
point of view but I'm confident it's the better solution over all.
The problem is that the write bandwidth is fundamentally limited to the
speed of your interconnect, and this is shared between all the nodes. So
if you are on a 1Gb ethernet, your total volume of writes between all
the replicas cannot exceed 1Gb. If you have 10 nodes, that limits your
write throughput to 100Mb/s. The only way to scale is either by:
1) Putting ever more NICs into each storage chassis and turn the
replication network into what is essentially a point-to-point network.
Complex, inefficient and adding to the cost - 1Gb NICs are cheap, but
10Gb ones are still expensive. There is also a limit on how many NICs
you can cram into a COTS storage node.
2) Upgrading to an ever faster interconnect (10Gb ethernet, Infiniband,
etc.). Expensive and still doesn't scale as you add more storage nodes.
Right now more storage nodes means slower storage, and that should
really be addressed.
Gordan