On 2023-08-04 14:52, Dave Chinner wrote:
On Fri, Aug 04, 2023 at 12:26:22PM -0700, Corey Hickey wrote:
On 2023-08-04 01:07, Dave Chinner wrote:
If you want to force XFS to do stripe width aligned allocation for
large files to match with how MD exposes it's topology to
filesytsems, use the 'swalloc' mount option. The down side is that
you'll hotspot the first disk in the MD array....
If I use 'swalloc' with the autodetected (wrong) swidth, I don't see any
unaligned writes.
If I manually specify the (I think) correct values, I do still get writes
aligned to sunit but not swidth, as before.
Hmmm, it should not be doing that - where is the misalignment
happening in the file? swalloc isn't widely used/tested, so there's
every chance there's something unexpected going on in the code...
I don't know how to tell the file position, but I wrote a one-liner for
blktrace that may help. This should tell the position within the block
device of writes enqueued.
For every time the alignment _changes_, the awk program prints:
* the previous line (if it exists and was not already printed)
* the current line
Lines from blktrace are prefixed by:
* a 'c' or 'p' for debugging the awk program
* the offset from a 2048-sector alignment
* a '--' as a separator
I have manually inserted blank lines into the output in order to
visually separate into three sections:
1. writes predominantly stripe-aligned
2. writes predominantly offset by one chunk
3. writes predominantly stripe-aligned again
-----------------------------------------------------------------------
$ sudo blktrace -d /dev/md10 -o - | blkparse -i - | awk 'BEGIN { prev=""; prev_offset=-1; } / Q / { offset=$8 % 2048; if (offset != prev_offset) { if (prev) { printf("p %4d -- %s\n", prev_offset, prev); prev="" }; printf("c %4d -- %s\n", offset, $0); prev_offset=offset; fflush(); } else { prev=$0 }} '
c 32 -- 9,10 11 1 0.000000000 213852 Q RM 32 + 8 [dd]
c 24 -- 9,10 11 2 0.000253462 213852 Q RM 24 + 8 [dd]
c 1024 -- 9,10 11 3 0.000434115 213852 Q RM 1024 + 32 [dd]
c 3 -- 9,10 11 4 0.001008057 213852 Q RM 3 + 1 [dd]
c 16 -- 9,10 11 5 0.001165978 213852 Q RM 16 + 8 [dd]
c 8 -- 9,10 11 6 0.001328206 213852 Q RM 8 + 8 [dd]
c 0 -- 9,10 11 7 0.001496647 213852 Q WS 2048 + 2048 [dd]
p 0 -- 9,10 1 469 10.544416303 213852 Q WS 6301696 + 2048 [dd]
c 128 -- 9,10 1 471 10.545831615 213789 Q FWFSM 62906496 + 64 [kworker/1:3]
c 0 -- 9,10 1 472 10.548127201 213852 Q WS 6303744 + 2048 [dd]
p 0 -- 9,10 0 5791 13.109985396 213852 Q WS 7804928 + 2048 [dd]
c 1027 -- 9,10 0 5793 13.113192558 213852 Q RM 7863299 + 1 [dd]
c 1040 -- 9,10 0 5794 13.136165405 213852 Q RM 7863312 + 8 [dd]
c 1032 -- 9,10 0 5795 13.136458182 213852 Q RM 7863304 + 8 [dd]
c 1024 -- 9,10 0 5796 13.136568992 213852 Q WS 7865344 + 2048 [dd]
p 1024 -- 9,10 1 2818 41.250430374 213852 Q WS 12133376 + 2048 [dd]
c 192 -- 9,10 1 2820 41.266187726 213789 Q FWFSM 62906560 + 64 [kworker/1:3]
c 1024 -- 9,10 1 2821 41.275578120 213852 Q WS 12135424 + 2048 [dd]
c 2 -- 9,10 5 1 41.266226029 213819 Q WM 2 + 1 [xfsaild/md10]
c 24 -- 9,10 5 2 41.266236639 213819 Q WM 24 + 8 [xfsaild/md10]
c 32 -- 9,10 5 3 41.266242160 213819 Q WM 32 + 8 [xfsaild/md10]
c 1024 -- 9,10 5 4 41.266246318 213819 Q WM 1024 + 32 [xfsaild/md10]
p 1024 -- 9,10 1 2823 41.308444405 213852 Q WS 12137472 + 2048 [dd]
c 256 -- 9,10 10 706 41.322338854 207685 Q FWFSM 62906624 + 64 [kworker/u64:11]
c 1024 -- 9,10 1 2825 41.334778677 213852 Q WS 12139520 + 2048 [dd]
p 1024 -- 9,10 3 3739 64.424114908 213852 Q WS 15668224 + 2048 [dd]
c 3 -- 9,10 3 3741 64.445830212 213852 Q RM 15726595 + 1 [dd]
c 16 -- 9,10 3 3742 64.455104423 213852 Q RM 15726608 + 8 [dd]
c 8 -- 9,10 3 3743 64.463494822 213852 Q RM 15726600 + 8 [dd]
c 0 -- 9,10 3 3744 64.470414156 213852 Q WS 15728640 + 2048 [dd]
p 0 -- 9,10 1 6911 71.983449607 213852 Q WS 20101120 + 2048 [dd]
c 320 -- 9,10 1 6913 71.985823522 213789 Q FWFSM 62906688 + 64 [kworker/1:3]
c 0 -- 9,10 1 6914 71.987115410 213852 Q WS 20103168 + 2048 [dd]
c 1 -- 9,10 5 6 71.985857777 213819 Q WM 1 + 1 [xfsaild/md10]
c 8 -- 9,10 5 7 71.985869209 213819 Q WM 8 + 8 [xfsaild/md10]
c 16 -- 9,10 5 8 71.985874249 213819 Q WM 16 + 8 [xfsaild/md10]
c 0 -- 9,10 1 6916 72.002414341 213852 Q WS 20105216 + 2048 [dd]
p 0 -- 9,10 1 6924 72.041196270 213852 Q WS 20113408 + 2048 [dd]
c 384 -- 9,10 4 1 72.041820949 211757 Q FWFSM 62906752 + 64 [kworker/u64:1]
c 0 -- 9,10 1 6926 72.043596586 213852 Q WS 20115456 + 2048 [dd]
-----------------------------------------------------------------------
I don't know if that's quite what you wanted, but hopefully it helps for
something.
One thing to try is to set extent size hints for the directories
these large files are going to be written to. That takes a lot of
the allocation decisions away from the size/shape of the individual
IO and instead does large file offset aligned/sized allocations
which are much more likely to be stripe width aligned. e.g. set a
extent size hint of 16MB, and the first write into a hole will
allocate a 16MB chunk around the write instead of just the size that
covers the write IO.
Can you please give me a documentation pointer for that? I wasn't able
to find the right thing via searching.
I see some references to size hints in mkfs.xfs, but it seems like you
refer to something to be set for specific directories at run-time.
Still, I'll heed your advice about not making a hotspot disk and allow XFS
to allocate as default.
Now that I understand that XFS is behaving as intended and I can't/shouldn't
necessarily aim for further alignment, I'll try recreating my real RAID,
trust in buffered writes and the MD stripe cache, and see how that goes.
Buffered writes won't guarantee you alignment, either, In fact, it's
much more likely to do weird stuff than direct IO. If your
filesystem is empty, then buffered writes can look *really good*,
but once the filesystem starts being used and has lots of
discontiguous free space or the system is busy enough that writeback
can't lock contiguous ranges of pages, writeback IO will look a
whole lot less pretty and you have little control over what
it does....
I'll keep that in mind. This filesystem doesn't get extensive writes
except when restoring from backup. That is why I started looking at
alignment, though--restoring from backup onto a new array with new
disks was incurring lots of RMW, reads were very delayed, and the
kernel was warning about hung tasks.
It probably didn't help that my RAID-5 was degraded due to a failed
disk I had to return. I audited my alignment choices anyway and found
some things I could do better, but I got stuck on XFS, hence this
thread.
My intended full stack is:
* RAID-5
* bcache (default settings--writethrough)
* dm-crypt
* XFS
..and I've operated that before without noticing anything so bad.
The alignment gets tricky, especially because bcache has a fixed
default data offset and doesn't quite propagate the topology of the
underlying backing device.
$ sudo blkid -i /dev/md5
/dev/md5: MINIMUM_IO_SIZE="131072" OPTIMAL_IO_SIZE="262144" PHYSICAL_SECTOR_SIZE="4096" LOGICAL_SECTOR_SIZE="512"
$ sudo blkid -i /dev/bcache0
/dev/bcache0: MINIMUM_IO_SIZE="512" OPTIMAL_IO_SIZE="262144" PHYSICAL_SECTOR_SIZE="512" LOGICAL_SECTOR_SIZE="512"
Some of that makes sense for a writeback scenario, but I think for
writethrough I want to align to the topology of the underlying
backing device.
Thanks again for all your time.
-Corey