On Fri, Aug 04, 2023 at 12:26:22PM -0700, Corey Hickey wrote: > On 2023-08-04 01:07, Dave Chinner wrote: > > If you want to force XFS to do stripe width aligned allocation for > > large files to match with how MD exposes it's topology to > > filesytsems, use the 'swalloc' mount option. The down side is that > > you'll hotspot the first disk in the MD array.... > > If I use 'swalloc' with the autodetected (wrong) swidth, I don't see any > unaligned writes. > > If I manually specify the (I think) correct values, I do still get writes > aligned to sunit but not swidth, as before. Hmmm, it should not be doing that - where is the misalignment happening in the file? swalloc isn't widely used/tested, so there's every chance there's something unexpected going on in the code... > ----------------------------------------------------------------------- > $ sudo mkfs.xfs -f -d sunit=1024,swidth=2048 /dev/md10 > mkfs.xfs: Specified data stripe width 2048 is not the same as the volume > stripe width 546816 > log stripe unit (524288 bytes) is too large (maximum is 256KiB) > log stripe unit adjusted to 32KiB > meta-data=/dev/md10 isize=512 agcount=16, agsize=982912 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=1, rmapbt=0 > = reflink=1 bigtime=1 inobtcount=1 > nrext64=0 > data = bsize=4096 blocks=15726592, imaxpct=25 > = sunit=128 swidth=256 blks > naming =version 2 bsize=4096 ascii-ci=0, ftype=1 > log =internal log bsize=4096 blocks=16384, version=2 > = sectsz=512 sunit=8 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > $ sudo mount -o swalloc /dev/md10 /mnt/tmp > ----------------------------------------------------------------------- > > There's probably something else I'm doing wrong there. Looks sensible, but it's likely still tripping over some non-obvious corner case in the allocation code. The allocation code is not simple (allocation alone has roughly 20 parameters that determine behaviour), especially with all the alignment setup stuff done before we even get to the allocation code... One thing to try is to set extent size hints for the directories these large files are going to be written to. That takes a lot of the allocation decisions away from the size/shape of the individual IO and instead does large file offset aligned/sized allocations which are much more likely to be stripe width aligned. e.g. set a extent size hint of 16MB, and the first write into a hole will allocate a 16MB chunk around the write instead of just the size that covers the write IO. > Still, I'll heed your advice about not making a hotspot disk and allow XFS > to allocate as default. > > Now that I understand that XFS is behaving as intended and I can't/shouldn't > necessarily aim for further alignment, I'll try recreating my real RAID, > trust in buffered writes and the MD stripe cache, and see how that goes. Buffered writes won't guarantee you alignment, either, In fact, it's much more likely to do weird stuff than direct IO. If your filesystem is empty, then buffered writes can look *really good*, but once the filesystem starts being used and has lots of discontiguous free space or the system is busy enough that writeback can't lock contiguous ranges of pages, writeback IO will look a whole lot less pretty and you have little control over what it does.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx