Re: How to deal with XFS stripe geometry mismatch with hardware RAID5

troby <Thorn.Roby@xxxxxxxxxxxxx> · Wed, 14 Mar 2012 10:43:44 -0700 (PDT)

/dev/sdb1       /data   xfs    
defaults,logdev=/dev/sda3,logbsize=256k,logbufs=8,largeio,nobarrier

meta-data=/dev/sdb1              isize=256    agcount=32, agsize=251772920
blks
         =                       sectsz=4096  attr=0
data     =                       bsize=4096   blocks=8056733408, imaxpct=2
         =                       sunit=2      swidth=24 blks, unwritten=1
naming   =version 2              bsize=4096
log      =external               bsize=4096   blocks=16000, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Mongo pre-allocates its datafiles and zero-fills them (there is a short
header at the start of each, not rewritten as far as I know)  and then
writes to them sequentially, wrapping around when it hits the end. In this
case the entire load is inserts, no updates, hence the sequential writes.
The data will not wrap around for about 6 months, at which time old files
will be overwritten starting from the beginning. The BBU is functioning and
the cache is set to write-back. The files are memory-mapped, I'll check
whether fsync is used. Flushing is done about every 30 seconds and takes
about 8 seconds.

One thing I'm wondering is whether the incorrect stripe structure I
specified with mkfs is actually written into the file system structure or
effectively just a hint to the kernel for what to use for a write size. If
not, could I specify the correct stripe width in the mount options and
override the incorrect width used by mkfs? Since the current average write
size is only about half the specified stripe size, and since I'm not using
md or xfs v.3 it seems the kernel is ignoring it for now. 

Stan Hoeppner wrote:
> 
> On 3/13/2012 6:21 PM, troby wrote:
> 
>> Short of recreating the filesystem with the correct stripe width, would
>> it
>> make sense to change the mount options to define a stripe width that
>> actually matches either the filesystem (11 stripe elements wide) or the
>> hardware (12 stripe elements wide)? Is there a danger of filesystem
>> corruption if I give fstab a mount geometry that doesn't match the values
>> used at filesystem creation time?
> 
> What would make sense is for you to first show
> 
> $ cat /etc/fstab
> $ xfs_info /dev/raid_device_name
> 
> before we recommend any changes.
> 
>> I'm unclear on the role of the RAID hardware cache in this. Since the
>> writes
>> are sequential, 
> 
> This seems to be an assumption at odds with other information you've
> provided.
> 
>> and since the volume of data written is such that it would
>> take about 3 minutes to actually fill the RAID cache, 
> 
> The PERC 700 operates in write-through cache mode if no BBU is present
> or the battery is degraded or has failed.  You did not state whether
> your PERC 700 has the BBU installed.  If not, you can increase write
> performance and decrease latency pretty substantially by adding the BBU
> which enables the write-back cache mode.
> 
> You may want to check whether MongoDB uses fsync writes by default.  If
> it does, and you don't have the BBU and write-back cache, this is
> affecting your write latency and throughput as well.
> 
>> I would think the data
>> would be resident in the cache long enough to assemble a full-width
>> stripe
>> at the hardware level and avoid the 4 I/O RAID5 penalty. 
> 
> Again, write-back-cache is only enabled with BBU on the PERC 700.  Do
> note that achieving full stripe width writes is as much a function of
> your application workload and filesystem tuning as it is the RAID
> firmware, especially if the cache is in write-through mode, in which
> case the firmware can't do much, if anything, to maximize full width
> stripes.
> 
> And keep in mind you won't hit the parity read-modify-write penalty on
> new stripe writes.  This only happens when rewriting existing stripes.
> Your reported 50ms of latency for 100KB write IOs seems to suggest you
> don't have the BBU installed and you're actually doing RMW on existing
> stripes, not strictly new stripe writes.  This is likely because...
> 
> As an XFS filesystem gets full (you're at ~87%), file blocks may begin
> to be written into free space within existing partially occupied RAID
> stripes.  This is where the RAID5/6 RMW penalty really kicks you in the
> a$$, especially if you have misaligned the filesystem geometry to the
> underlying RAID geometry.
> 
> -- 
> Stan
> 
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs
> 
> 

-- 
View this message in context: http://old.nabble.com/How-to-deal-with-XFS-stripe-geometry-mismatch-with-hardware-RAID5-tp33498437p33504048.html
Sent from the Xfs - General mailing list archive at Nabble.com.

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs