Re: How to deal with XFS stripe geometry mismatch with hardware RAID5

Brian Candler <B.Candler@xxxxxxxxx> · Wed, 14 Mar 2012 21:05:14 +0000

On Wed, Mar 14, 2012 at 10:43:44AM -0700, troby wrote:
> Mongo pre-allocates its datafiles and zero-fills them (there is a short
> header at the start of each, not rewritten as far as I know)  and then
> writes to them sequentially, wrapping around when it hits the end. In this
> case the entire load is inserts, no updates, hence the sequential writes.
> The data will not wrap around for about 6 months, at which time old files
> will be overwritten starting from the beginning. The BBU is functioning and
> the cache is set to write-back. The files are memory-mapped, I'll check
> whether fsync is used. Flushing is done about every 30 seconds and takes
> about 8 seconds.

How much data has been added to mongodb in those 30 seconds?

If everything really was being written sequentially then I reckon you could
write about 6.6GB in that time (11 disks x 75MB/sec x 8 seconds). From your
posting I suspect you are not achieving that level of performance :-)

If it really is being written sequentially to a continguous file then the
stripe alignment won't make any difference, because this is just a big
pre-allocated file, and XFS will do its best to give one big contiguous
chunk of space for it.

Anwyay, you don't need to guess these things, you can easily find out.

(1) Is the file preallocated and contiguous, or fragmented?

    # xfs_bmap /path/to/file

This will show you if you get one huge extent. If you get a number of large
extents (say 100MB+) that would be fine for performance too.  If you get
lots of shrapnel then there's a problem.

(2) Are you really writing sequentially?

    # btrace /dev/whatever | grep ' [DC] '

This will show you block requests dispatched [D] and completed [C] to the
controller.

And at a higher level:

    # strace -p <pid-of-mongodb-process>

will show you the seek/write/read operations that the application is
performing.

Once you have the answers to those, you can make a better judgement as to
what's happening.

(3) One other thing to check:

cat /sys/block/xxx/bdi/read_ahead_kb
cat /sys/block/xxx/queue/max_sectors_kb

Increasing those to 1024 (echo 1024 > ....) may make some improvement.

> One thing I'm wondering is whether the incorrect stripe structure I
> specified with mkfs is actually written into the file system structure

I am guessing that probably things like chunks of inodes are stripe-aligned. 
But if you're really writing sequentially to a huge contiguous file then it
won't matter anyway.

Regards,

Brian.

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs