Re: read-modify-write occurring for direct I/O on RAID-5

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 7 Aug 2023 08:38:48 +1000

On Sun, Aug 06, 2023 at 11:21:38AM -0700, Corey Hickey wrote:
> On 2023-08-05 15:37, Dave Chinner wrote:
> > On Fri, Aug 04, 2023 at 06:44:47PM -0700, Corey Hickey wrote:
> > > On 2023-08-04 14:52, Dave Chinner wrote:
> > > > On Fri, Aug 04, 2023 at 12:26:22PM -0700, Corey Hickey wrote:
> > > > > On 2023-08-04 01:07, Dave Chinner wrote:
> > > > > > If you want to force XFS to do stripe width aligned allocation for
> > > > > > large files to match with how MD exposes it's topology to
> > > > > > filesytsems, use the 'swalloc' mount option. The down side is that
> > > > > > you'll hotspot the first disk in the MD array....
> > > > > 
> > > > > If I use 'swalloc' with the autodetected (wrong) swidth, I don't see any
> > > > > unaligned writes.
> > > > > 
> > > > > If I manually specify the (I think) correct values, I do still get writes
> > > > > aligned to sunit but not swidth, as before.
> > > > 
> > > > Hmmm, it should not be doing that - where is the misalignment
> > > > happening in the file? swalloc isn't widely used/tested, so there's
> > > > every chance there's something unexpected going on in the code...
> > > 
> > > I don't know how to tell the file position, but I wrote a one-liner for
> > > blktrace that may help. This should tell the position within the block
> > > device of writes enqueued.
> > 
> > xfs_bmap will tell you the file extent layout (offset to lba relationship).
> > (`xfs_bmap -vvp <file>` output is prefered if you are going to paste
> > it into an email.)
> Ah, nice; the flags even show the alignment.
> 
> Here are the results for a filesystem on a 2-data-disk RAID-5 with 128 KB
> chunk size.

....

> $ sudo xfs_bmap -vvp /mnt/tmp/test.bin
> /mnt/tmp/test.bin:
>  EXT: FILE-OFFSET           BLOCK-RANGE        AG AG-OFFSET          TOTAL FLAGS
>    0: [0..7806975]:         512..7807487        0 (512..7807487)   7806976 000000
>    1: [7806976..15613951]:  7864576..15671551   1 (512..7807487)   7806976 000011
>    2: [15613952..20971519]: 15728640..21086207  2 (512..5358079)   5357568 000000

Thanks for that, I think it points out the problem quite clearly.
The stripe width allocation alignment looks to be working as
intended - the "AG-OFFSET" column has the same values in each extent
so within the AG address space everything is correctly "stripe
width" aligned.

What we see here is a mkfs.xfs "anti hotspot" behaviour with striped
layouts. That is, it automagically sizes the AGs such that each AG
header sits on a different stripe unit within the stripe so that the
AG headers don't end up all on the same physical stripe unit.

That results in the entire AG being aligned to the stripe unit
rather than the stripe width. And so when we do stripe width aligned
allocation within the AG, it assumes that the AG itself is stripe
width aligned, which it isn't....

So, if you were to do something like this:

# mkfs.xfs -d agsize=1048576b ....

To force the AG size to be a multiple of stripe width, mkfs will
issue a warning that it is going to place all the AG headers on the
same stripe unit, but then go and do what you asked it to do.

That should work around the problem you are seeing, meanwhile I
suspect the swalloc mechanism might need a tweak to do physical LBA
alignment, not AG offset alignment....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx