Re: read-modify-write occurring for direct I/O on RAID-5

Dave Chinner <david@xxxxxxxxxxxxx> · Sat, 5 Aug 2023 07:52:56 +1000

On Fri, Aug 04, 2023 at 12:26:22PM -0700, Corey Hickey wrote:
> On 2023-08-04 01:07, Dave Chinner wrote:
> > If you want to force XFS to do stripe width aligned allocation for
> > large files to match with how MD exposes it's topology to
> > filesytsems, use the 'swalloc' mount option. The down side is that
> > you'll hotspot the first disk in the MD array....
> 
> If I use 'swalloc' with the autodetected (wrong) swidth, I don't see any
> unaligned writes.
> 
> If I manually specify the (I think) correct values, I do still get writes
> aligned to sunit but not swidth, as before.

Hmmm, it should not be doing that - where is the misalignment
happening in the file? swalloc isn't widely used/tested, so there's
every chance there's something unexpected going on in the code...

> -----------------------------------------------------------------------
> $ sudo mkfs.xfs -f -d sunit=1024,swidth=2048 /dev/md10
> mkfs.xfs: Specified data stripe width 2048 is not the same as the volume
> stripe width 546816
> log stripe unit (524288 bytes) is too large (maximum is 256KiB)
> log stripe unit adjusted to 32KiB
> meta-data=/dev/md10              isize=512    agcount=16, agsize=982912 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=0
>          =                       reflink=1    bigtime=1 inobtcount=1
> nrext64=0
> data     =                       bsize=4096   blocks=15726592, imaxpct=25
>          =                       sunit=128    swidth=256 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=16384, version=2
>          =                       sectsz=512   sunit=8 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> $ sudo mount -o swalloc /dev/md10 /mnt/tmp
> -----------------------------------------------------------------------
> 
> There's probably something else I'm doing wrong there.

Looks sensible, but it's likely still tripping over some non-obvious
corner case in the allocation code. The allocation code is not
simple (allocation alone has roughly 20 parameters that determine
behaviour), especially with all the alignment setup stuff done
before we even get to the allocation code...

One thing to try is to set extent size hints for the directories
these large files are going to be written to. That takes a lot of
the allocation decisions away from the size/shape of the individual
IO and instead does large file offset aligned/sized allocations
which are much more likely to be stripe width aligned. e.g. set a
extent size hint of 16MB, and the first write into a hole will
allocate a 16MB chunk around the write instead of just the size that
covers the write IO.

> Still, I'll heed your advice about not making a hotspot disk and allow XFS
> to allocate as default.
> 
> Now that I understand that XFS is behaving as intended and I can't/shouldn't
> necessarily aim for further alignment, I'll try recreating my real RAID,
> trust in buffered writes and the MD stripe cache, and see how that goes.

Buffered writes won't guarantee you alignment, either, In fact, it's
much more likely to do weird stuff than direct IO. If your
filesystem is empty, then buffered writes can look *really good*,
but once the filesystem starts being used and has lots of
discontiguous free space or the system is busy enough that writeback
can't lock contiguous ranges of pages, writeback IO will look a
whole lot less pretty and you have little control over what
it does....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx