Re: 30 TB RAID6 + XFS slow write performance

Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> · Tue, 19 Jul 2011 17:37:25 -0500

On 7/19/2011 3:37 AM, Emmanuel Florac wrote:
> Le Mon, 18 Jul 2011 14:58:55 -0500 vous écriviez:
> 
>> card: MegaRAID SAS 9260-16i
>> disks: 14x Barracuda® XT ST33000651AS 3TB (2 hot spares).
>> RAID6
>> ~ 30TB

> This card doesn't activate the write cache without a BBU present. Be
> sure you have a BBU or the performance will always be unbearably awful.

In addition to all the other recommendations, once the BBU is installed,
disable the individual drive caches (if this isn't done automatically),
and set the controller cache mode to 'write back'.  The write through
and direct I/O cache modes will deliver horrible RAID6 write performance.

And, BTW, RAID6 is a horrible choice for a parallel, small file, high
random I/O workload such as you've described.  RAID10 would be much more
suitable.  Actually, any striped RAID is less than optimal for such a
small file workload.  The default stripe size for the LSI RAID
controllers, IIRC, is 64KB.  With 14 spindles of stripe width you end up
with 64*14 = 896KB.  XFS will try to pack as many of these 50-150K files
into a single extent, but you're talking 6 to 18 files per extent, and
this is wholly dependent on the parallel write pattern, and in which of
the allocation groups XFS decides to write each file.  XFS isn't going
to be 100% efficient in this case.  Thus, you will end up with many
partial stripe width writes, eliminating much of the performance
advantage of striping.

These are large 7200 rpm SATA drives which have poor seek performance to
begin with, unlike the 'small' 300GB 15k SAS drives.  You're robbing
that poor seek performance further by:

1.  Using double parity striped RAID
2.  Writing thousands of small files in parallel

This workload is very similar to the case of a mail server using the
maildir storage format.  If you read the list archives you'll see
recommendations for an optimal storage stack setup for this workload.
It goes something like this:

1.  Create a linear array of hardware RAID1 mirror sets.
    Do this all in the controller if it can do it.
    If not, use Linux RAID (mdadm) to create a '--linear' array of the
    multiple (7 in your case, apparently) hardware RAID1 mirror sets

2.  Now let XFS handle the write parallelism.  Format the resulting
    7 spindle Linux RAID device with, for example:

    mkfs.xfs -d agcount=14 /dev/md0

By using this configuration you eliminate the excessive head seeking
associated with the partial stripe write problems of RAID6, restoring
performance efficiency to the array.  Using 14 allocation groups allows
XFS to write write, at minimum, 14 such files in parallel.  This may not
seem like a lot given you have ~200 writers, but it's actually far more
than what you're getting now, or what you'll get with striped parity
RAID.  Consider the 150KB file case:  14*150KB = 2.1MB/s.  Assuming this
hardware and software stack can sink 210MB/s with this workload, that's
~1400 files written per second, or 84,000 files per hour.  Would this be
sufficient for your application?

Now that we've covered the XFS and hardware RAID side of this equation,
does your application run directly on the this machine, or are you
writing over NFS or CIFS to this XFS filesystem?  If so, that's another
fly in the ointment we may have to deal with.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs