Usage Case: just not getting the performance I was hoping for

david at davidcoulson.net (David Coulson) · Thu, 15 Mar 2012 19:33:46 -0400

Is there a FAQ/document somewhere with optimal mkfs and mount options 
for ext4 and xfs? Is xfs still the 'desired' filesystem for gluster bricks?

On 3/15/12 3:22 AM, Brian Candler wrote:
> On Wed, Mar 14, 2012 at 11:09:28PM -0500, D. Dante Lorenso wrote:
>> get 50-60 MB/s transfer speeds tops when sending large files (>  2GB)
>> to gluster. When copying a directory of small files, we get<= 1
>> MB/s performance!
>>
>> My question is ... is this right?  Is this what I should expect from
>> Gluster, or is there something we did wrong?  We aren't using super
>> expensive equipment, granted, but I was really hoping for better
>> performance than this given that raw drive speeds using dd show that
>> we can write at 125+ MB/s to each "brick" 2TB disk.
> Unfortunately I don't have any experience with replicated volumes, but the
> raw glusterfs protocol is very fast: a single brick which is a 12-disk raid0
> stripe can give 500MB/sec easily over 10G ethernet without any tuning.
>
> I would expect a distributed volume to work fine too, as it just sends each
> request to one of N nodes.
>
> Striped volumes are unfortunately broken on top of XFS at the moment:
> http://oss.sgi.com/archives/xfs/2012-03/msg00161.html
>
> Replicated volumes, from what I've read, need to touch both servers even for
> read operations (for the self-healing functionality), and that could be a
> major bottleneck.
>
> But there are a few basic things to check:
>
> (1) Are you using XFS for the underlying filesystems? If so, did you mount
> them with the "inode64" mount option?  Without this, XFS performance sucks
> really badly for filesystems>1TB
>
> Without inode64, even untarring files into a single directory will make XFS
> distribute them between AGs, rather than allocating contiguous space for
> them.
>
> This is a major trip-up and there is currently talk of changing the default
> to be inode64.
>
> (2) I have this in /etc/rc.local:
>
> for i in /sys/block/sd*/bdi/read_ahead_kb; do echo 1024>"$i"; done
> for i in /sys/block/sd*/queue/max_sectors_kb; do echo 1024>"$i"; done
>
>> If I can't get gluster to work, our fail-back plan is to convert
>> these 8 servers into iSCSI targets and mount the storage onto a
>> Win2008 head and continue sharing to the network as before.
>> Personally, I would rather us continue moving toward CentOS 6.2 with
>> Samba and Gluster, but I can't justify the change unless I can
>> deliver the performance.
> Optimising replicated volumes I can't help with.
>
> However if you make a simple RAID10 array on each server, and then join the
> servers into a distributed gluster volume, I think it will rock.  What you
> lose is the high-availability, i.e.  if one server fails a proportion of
> your data becomes unavailable until you fix it - but that's no worse than
> your iSCSI proposal (unless you are doing something complex, like drbd
> replication between pairs of nodes and HA failover of the iSCSI target)
>
> BTW, Linux md RAID10 with 'far' layout is really cool; for reads it performs
> like a RAID0 stripe, and it reduces head seeking for random access.
>
> Regards,
>
> Brian.
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users