Performance question

B.Candler at pobox.com (Brian Candler) · Thu, 16 Feb 2012 09:45:38 +0000

On Wed, Feb 15, 2012 at 11:06:05PM +0100, Arnold Krille wrote:
> What was interesting is that pure-linux-nfs from 
> node2 to node1 had roughly the same results as glusterfs on node2 to a single-
> brick volume on node1...

Yes, that's what I was hoping you'd see. There's nothing inherently
inefficient about the Gluster protocol, and the latency is mostly built up
from network round-trips. Even the userland FUSE client doesn't add much
additional latency.

I would expect you'll find similar with a distributed volume, since the only
difference is which node the request is dispatched to.

> So the comparisons would be:
>  1. single local disk

Fastest

>  2. pure nfs between the nodes
>  3.1 glusterfs (aka fuse-mount) with single-brick volumes across network
>  3.2 glusterfs with dual-brick/dual-node distributed volume

Those should be a bit slower than local disk but similar to each other

>  3.3 glusterfs with dual-brick/dual-node replicated volume

That's where I think the speed difference will be significant. Writes have
to be committed to both nodes, and when you open a file for read it has to
check on both nodes to see if self-healing is required.

If that cost is too high for your application, then you could consider
allowing writes to just one node with some sort of 'catch-up' replication to
the other node: e.g. glusterfs geo-replication, or DRBD configured for
asynchronous mirroring. The issue there is that in a failure situation, some
committed data may not have made it across to the mirror.

> Do you have more inputs for the test-regime?

Consider carefully what your application for this is, and try to make your
benchmarking tool implement the expected workload as closely as possible. If
there are different workloads then the final solution could use a mix of
technologies, or a mix of gluster volume types and/or underlying block
storage layouts.

If you have two disks in a node, have a look at Linux RAID10 with 'far'
layout.  It means that all the data is stored in the first half of each
disk, so that for read-heavy applications, head seeking is reduced and you
get the higher transfer rates from the outer cylinders.

mdadm --create /dev/md/raid10 -n 2 -c 256 -l raid10 -p f2 -b internal \
  /dev/sda2 /dev/sdb2

(and tune the chunk size for your application too)

http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10

Of course, this gives redundancy within a node, so you can then choose not
to have real-time replication between nodes (or only do catch-up
replication) if that suits your needs.

Regards,

Brian.