On Wed, Feb 15, 2012 at 11:06:05PM +0100, Arnold Krille wrote: > What was interesting is that pure-linux-nfs from > node2 to node1 had roughly the same results as glusterfs on node2 to a single- > brick volume on node1... Yes, that's what I was hoping you'd see. There's nothing inherently inefficient about the Gluster protocol, and the latency is mostly built up from network round-trips. Even the userland FUSE client doesn't add much additional latency. I would expect you'll find similar with a distributed volume, since the only difference is which node the request is dispatched to. > So the comparisons would be: > 1. single local disk Fastest > 2. pure nfs between the nodes > 3.1 glusterfs (aka fuse-mount) with single-brick volumes across network > 3.2 glusterfs with dual-brick/dual-node distributed volume Those should be a bit slower than local disk but similar to each other > 3.3 glusterfs with dual-brick/dual-node replicated volume That's where I think the speed difference will be significant. Writes have to be committed to both nodes, and when you open a file for read it has to check on both nodes to see if self-healing is required. If that cost is too high for your application, then you could consider allowing writes to just one node with some sort of 'catch-up' replication to the other node: e.g. glusterfs geo-replication, or DRBD configured for asynchronous mirroring. The issue there is that in a failure situation, some committed data may not have made it across to the mirror. > Do you have more inputs for the test-regime? Consider carefully what your application for this is, and try to make your benchmarking tool implement the expected workload as closely as possible. If there are different workloads then the final solution could use a mix of technologies, or a mix of gluster volume types and/or underlying block storage layouts. If you have two disks in a node, have a look at Linux RAID10 with 'far' layout. It means that all the data is stored in the first half of each disk, so that for read-heavy applications, head seeking is reduced and you get the higher transfer rates from the outer cylinders. mdadm --create /dev/md/raid10 -n 2 -c 256 -l raid10 -p f2 -b internal \ /dev/sda2 /dev/sdb2 (and tune the chunk size for your application too) http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10 Of course, this gives redundancy within a node, so you can then choose not to have real-time replication between nodes (or only do catch-up replication) if that suits your needs. Regards, Brian.