Re: RAID60/mdadm/xfs performance tuning

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 6 Dec 2011 09:48:20 +1100

On Mon, Dec 05, 2011 at 01:50:58PM -0500, Paul Anderson wrote:
> I've set up an software RAID-60 array composed of 7 software RAID6's,
> each with 32k chunks, 18 devices total (16 data, 2 parity), and in
> theory appropriate setup parameters according to a nice white paper
> written by Christoph and presented this last summer at LinuxCon.
> 
> My question is, if the mdraid and XFS are all configured properly,
> would I expect to see any read operations when doing a write-only
> test?  I would have assumed that I would not, since XFS should write
> stripe-aligned sets of data, and in theory nothing needs to be read
> (no read-modify-write going on, I would think).

That depends. What's your "write only" test?

> The performance is great, but I'm wondering if I need to keep looking.

If performance is great, then what's the problem?

> 
> Thanks,
> 
> Paul Anderson
> 
> Here's the details for kernel 2.6.38.5:
> 
> mdadm --detail /dev/md0  (md1, md2, md3, md4, md5, and md6 all the same)
> /dev/md0:
....
>      Chunk Size : 32K
> 
> /dev/md8 is the RAID0 that concatenates the above RAID6's, making a
> single RAID60:
> 
>  mdadm --detail /dev/md8
> /dev/md8:
....
>      Chunk Size : 4096K (this is what the RAID0 container thinks, but
> I ignore it for xfs)

You should set the RAID0 chunk size to the stripe width of the
underlying RAID6 volume (i.e. 512k).

> xfs_info /exports/
> meta-data=/dev/md8               isize=256    agcount=204, agsize=268435448 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=54698370048, imaxpct=1
>          =                       sunit=8      swidth=1024 blks

Because XFS has clearly not been configured correctly. You've given
it a stripe unit of 32k (the RAID6 chunk size), and a width of 4MB
(the RAID0 chunk size).

What you are doing is aligning allocation to individual disks in the
RAID6 volumes but the filesystem doesn't know what the stripe width
of those volumes are so can't really align correctly to the RAID6
geometry. And because it is not set up as a sunit = 128 (512k), it
can't align to the RAID0 on top of it correctly, either.

You need to align all layers of the stack to each other so the
filesystem has a consistent view of stripe unit and widths. In this
configuration, the RAID0 really needs a chunk size of 512k to match
the RAID6 stripe width. Then you can chose from two different valid
alignments for the filesytsem - align to the underlying RAID6 or to
the top level RAID0.

If you have a small file intensive workload, then aligning to the
RAID6 is probably best so that small files can pack full RAID6
stripe widths. If you have a bandwidth intensive workload, then
aligning to the RAID0 is probaly best so that large writes are
aligned to the full stripe width of the underlying RAID6 devices.

Either way, you need to understand and test your workload to improve
on whatever the default XFS settings give you.

> I made the filesystem like this:
> mkfs.xfs -L $(hostname) -l su=32768 -d su=32768,sw=128 /dev/md8
> 
> mount options: inode64,largeio,swalloc,delaylog,logbsize=256k,logbufs=8,noatime,nodiratime

Why largeio,swalloc? Have you determined that you're actually
getting hot disks in your array without it?

FWIW, delaylog and logbufs are the default so you don't need to set
them, and nodiratime is a subset of noatime, so you don't need to
specify that, either.

> I intended to make it with an external log, but forgot.

So you've determined an internal log is a performance bottleneck for
your workload?

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs