Re: Question regarding XFS on LVM over hardware RAID.

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 4 Feb 2014 08:41:28 +1100

On Mon, Feb 03, 2014 at 11:12:39AM -0500, C. Morgan Hamill wrote:
> Excerpts from Dave Chinner's message of 2014-02-02 16:21:52 -0500:
> > On Sat, Feb 01, 2014 at 03:06:17PM -0600, Stan Hoeppner wrote:
> > > On 1/31/2014 3:14 PM, C. Morgan Hamill wrote:
> > > > So, basically, --dataalignment is my friend during pvcreate and
> > > > lvcreate.
> > > 
> > > If the logical sector size reported by your RAID controller is 512
> > > bytes, then "--dataalignment=9216s" should start your data section on a
> > > RAID60 stripe boundary after the metadata section.
> > > 
> > > Tthe PhysicalExtentSize should probably also match the 4608KB stripe
> > > width, but this is apparently not possible.  PhysicalExtentSize must be
> > > a power of 2 value.  I don't know if or how this will affect XFS aligned
> > > write out.  You'll need to consult with someone more knowledgeable of LVM.
> > 
> > You can't do single IOs of that size, anyway, so this is where the
> > BBWC on the raid controller does it's magic and caches sequntial IOs
> > until it has full stripe writes cached....
> 
> So I am probably missing something here, could you clarify?  Are you
> saying that I can't do single IOs of that size (by which I take your
> meaning to be IOs as small as 9216 sectors) because my RAID controllers
> controller won't let me (i.e., it will cache anything smaller than the
> stripe size anyway)?

Typical limitations on IO size are the size of the hardware DMA
scatter-gather rings of the HBA/raid controller. For example, the
two hardware RAID controllers in my largest test box have
limitations of 70 and 80 segments and maximum IO sizes of 280k and
320k.

And looking at the IO being dispatched with blktrace, I see the
maximum size is:

  8,80   2       61     0.769857112 44866  D  WS 12423408 + 560 [qemu-system-x86]
  8,80   2       71     0.769877563 44866  D  WS 12423968 + 560 [qemu-system-x86]
  8,80   2       72     0.769889767 44866  D  WS 12424528 + 560 [qemu-system-x86]
                                                            ^^^

560 sectors or 280k. So for this hardware, sequential 280k writes
are hitting the BBWC. And because they are sequential, the BBWC is
writing them back as fully stripe writes after aggregating them in
NVRAM. Hence there are no performance diminishing RMW cycles
occurring, even though the individual IO size is much smaller than
the stripe unit/width....

> Or are you saying that XFS with these given
> settings won't make writes that small (which seems false, since I'm
> essentially telling it to do writes of precisely that size).  I'm a bit
> unclear on that.

What su/sw tells XFs is how to align allocation of files, so that
when we dispatch sequential IO to that file it is aligned to the
underlying storage because the extents that the filesystem allocated
for it are aligned. This means that if you write exactly one stripe
width of data, it will hit each disk exactly once. It might take 10
IOs to get the data to the storage, but it will only hit each disk
once.

The function of the stripe cache (in software raid) and the BBWC (in
hardware RAID) is to prevent RMW cycles while the
filesystem/hardware is still flinging data at the RAID lun. Only
once the controller has complete stripe widths will it calculate
parity and write back the data, thereby avoiding a RMW cycle....

> In addition, does this in effect mean that when it comes to LVM, extent
> size makes no difference for alignment purposes?  So I don't have to
> worry about anything other that aligning the beginning and ending of
> logical volumes, volume groups, etc. to 9216 sector multiples?

No, you still have to align everything to the underlying storage so
that the filesystem on top of the volumes is correctly aligned.
Where the data will be written (i.e. howthe filesystem allocates the
underlying blocks) determines the IO alignment of sequential/large
user IOs, and that matters far more than the size of the sequntial
IOs the kernel uses to write the data.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs