Re: Replicate/AFR Using Broadcast/Multicast?

Gordan Bobic <gordan@xxxxxxxxxx> · Wed, 13 Oct 2010 22:30:21 +0100

On 10/13/2010 01:22 PM, Beat Rubischon wrote:
Hi Gordan!

Quoting<gordan@xxxxxxxxxx>  (13.10.10 10:06):

What sort of a cluster are you running with that many nodes? RHCS?
Heartbeat? Something else entirely? In what arrangement?

High performance clusters. The main target Gluster was made for :-)

I'm curious about your use case. I'm guessing it is mostly dependant on 
throughput and not particularly sensitive to I/O latency.

Even the most expensive GigE switch chassis could be killed by 125+ MBytes
of traffic which is almost nothing :-)
Sounds like a typical example of cost not being a good measure of
quality and performance. :)

It's simply a technical limit. Think about what broadcast is and how it
passes a switch.

I'm fully aware of that, but if your switching fabric can't handle the 
full rated bandwidth of the switch, that's pretty poor. Then again, I 
expect specmanship* everywhere these days and don't believe any figures 
until I've tested them myself.

In Infiniband...
Sure, but historically in the networking space, non-ethernet
technologies have always been niche, cost ineffective in terms of
price/performance and only had a temporary performance advantage.

Right.  You'll be surprised but the price per port is much lower in the
Infiniband world compared to the 10GigE world. When using GlusterFS inside a
datacenter Infiniband could be a good choice.

Maybe this year. Unlikely to be the case next year.

Right now more storage nodes means slower storage, and that should
really be addressed.

Wrong. Assuming you have a "distribute" concept. 10 clients talks to 5
servers. Storing a file means the client writes the file to one of the
servers. Reading the same. So the bandwidth of each server is accumulated.
With GigE this means you'll have about 600MBytes/s network bandwidth.
Additional servers will add additional bandwidth - as long as you scale not
only servers but also clients. One small exception: The lookup of a file
must be directed to all servers. One of the reasons why GlusterFS is
"better" for a smaller amount of large files as for a large amount of
smaller files.

Multiple lookup causes latency, and latency is already a serious issue 
on Gluster. I'm talking about the straight replicate case. The number of 
replicas is inversely proportional to the throughput.

Right when you use a "replicate" concept. Your client has to write to both
members of the replica.

I usually run with server-side replication specifically for that reason 
- I can have a dedicated VLAN for storage servers with as much network 
bandwidth I can throw at it. Then I can have the servers sort out the 
replication overheads between them, rather than needing a multiple of 
bandwidth to the clients as well.

Additional replicas will consume additional
bandwith. But hey - who needs more then two replicas? BTW: The servers will
never talk to each other. It's always the client who transfers the data.

Unless you use server-side replicate, which is much more manageable and 
controllable in terms of bandwidth requirements. And trust me, > 2 
replicas is useful. I have seen both disks in a RAID1 stripe fail more 
than once.

The perfect solution is probably a "distribute" over a "replicate". Mirror
the files over two bricks. Use your mirrors to bild a large filesystem with
replicate. Your performance will scale with the amount of bricks but you'll
keep the stability of a fully redundant setup.

Depends on your use case. Sometimes it is more useful to have all the 
data locally available for read-performance. But in that case write 
performance goes through the floor with that many replicase. 
Broadcasting the writes only once would solve it in one fell swoop.

Gordan

*specmanship, n: The art of misrepresenting capabilities of a device for 
marketting purposes, typically by saying it will do X and Y when it 
cannot in fact to X and Y at the same time.