On Fri, Aug 04, 2023 at 06:44:47PM -0700, Corey Hickey wrote: > On 2023-08-04 14:52, Dave Chinner wrote: > > On Fri, Aug 04, 2023 at 12:26:22PM -0700, Corey Hickey wrote: > > > On 2023-08-04 01:07, Dave Chinner wrote: > > > > If you want to force XFS to do stripe width aligned allocation for > > > > large files to match with how MD exposes it's topology to > > > > filesytsems, use the 'swalloc' mount option. The down side is that > > > > you'll hotspot the first disk in the MD array.... > > > > > > If I use 'swalloc' with the autodetected (wrong) swidth, I don't see any > > > unaligned writes. > > > > > > If I manually specify the (I think) correct values, I do still get writes > > > aligned to sunit but not swidth, as before. > > > > Hmmm, it should not be doing that - where is the misalignment > > happening in the file? swalloc isn't widely used/tested, so there's > > every chance there's something unexpected going on in the code... > > I don't know how to tell the file position, but I wrote a one-liner for > blktrace that may help. This should tell the position within the block > device of writes enqueued. xfs_bmap will tell you the file extent layout (offset to lba relationship). (`xfs_bmap -vvp <file>` output is prefered if you are going to paste it into an email.) > For every time the alignment _changes_, the awk program prints: > * the previous line (if it exists and was not already printed) > * the current line > > Lines from blktrace are prefixed by: > * a 'c' or 'p' for debugging the awk program > * the offset from a 2048-sector alignment > * a '--' as a separator > > I have manually inserted blank lines into the output in order to > visually separate into three sections: > 1. writes predominantly stripe-aligned > 2. writes predominantly offset by one chunk > 3. writes predominantly stripe-aligned again > > ----------------------------------------------------------------------- > $ sudo blktrace -d /dev/md10 -o - | blkparse -i - | awk 'BEGIN { prev=""; prev_offset=-1; } / Q / { offset=$8 % 2048; if (offset != prev_offset) { if (prev) { printf("p %4d -- %s\n", prev_offset, prev); prev="" }; printf("c %4d -- %s\n", offset, $0); prev_offset=offset; fflush(); } else { prev=$0 }} ' > c 32 -- 9,10 11 1 0.000000000 213852 Q RM 32 + 8 [dd] > c 24 -- 9,10 11 2 0.000253462 213852 Q RM 24 + 8 [dd] inobt + finobt metadata reads. > c 1024 -- 9,10 11 3 0.000434115 213852 Q RM 1024 + 32 [dd] Inode cluster read. > c 3 -- 9,10 11 4 0.001008057 213852 Q RM 3 + 1 [dd] AGFL read. > c 16 -- 9,10 11 5 0.001165978 213852 Q RM 16 + 8 [dd] > c 8 -- 9,10 11 6 0.001328206 213852 Q RM 8 + 8 [dd] AG freespace btree block reads. <inode now allocated> > c 0 -- 9,10 11 7 0.001496647 213852 Q WS 2048 + 2048 [dd] Data writes. > p 0 -- 9,10 1 469 10.544416303 213852 Q WS 6301696 + 2048 [dd] > c 128 -- 9,10 1 471 10.545831615 213789 Q FWFSM 62906496 + 64 [kworker/1:3] > c 0 -- 9,10 1 472 10.548127201 213852 Q WS 6303744 + 2048 [dd] Seek for journal IO between two sequential, contiguous data writes. > p 0 -- 9,10 0 5791 13.109985396 213852 Q WS 7804928 + 2048 [dd] > c 1027 -- 9,10 0 5793 13.113192558 213852 Q RM 7863299 + 1 [dd] > c 1040 -- 9,10 0 5794 13.136165405 213852 Q RM 7863312 + 8 [dd] > c 1032 -- 9,10 0 5795 13.136458182 213852 Q RM 7863304 + 8 [dd] Data write at tail end of AG, followed by read of the AGF and AG freespace btree blocks in next AG... > c 1024 -- 9,10 0 5796 13.136568992 213852 Q WS 7865344 + 2048 [dd] ... And the data write continues but I don;t think that is aligned. $ echo $(((7865344 / 2048) * 2048)) 7864320 $ Yeah, so if that was aligned, it would start at LBA 7864320, not 7865344. > p 1024 -- 9,10 1 2818 41.250430374 213852 Q WS 12133376 + 2048 [dd] > c 192 -- 9,10 1 2820 41.266187726 213789 Q FWFSM 62906560 + 64 [kworker/1:3] > c 1024 -- 9,10 1 2821 41.275578120 213852 Q WS 12135424 + 2048 [dd] Journal IO breaking up two unaligned contiguous data writes. > c 2 -- 9,10 5 1 41.266226029 213819 Q WM 2 + 1 [xfsaild/md10] > c 24 -- 9,10 5 2 41.266236639 213819 Q WM 24 + 8 [xfsaild/md10] > c 32 -- 9,10 5 3 41.266242160 213819 Q WM 32 + 8 [xfsaild/md10] > c 1024 -- 9,10 5 4 41.266246318 213819 Q WM 1024 + 32 [xfsaild/md10] Metadata writeback of AGI 0, inobt, finobt and inode cluster blocks. > p 1024 -- 9,10 1 2823 41.308444405 213852 Q WS 12137472 + 2048 [dd] > c 256 -- 9,10 10 706 41.322338854 207685 Q FWFSM 62906624 + 64 [kworker/u64:11] > c 1024 -- 9,10 1 2825 41.334778677 213852 Q WS 12139520 + 2048 [dd] Journal IO. > p 1024 -- 9,10 3 3739 64.424114908 213852 Q WS 15668224 + 2048 [dd] > c 3 -- 9,10 3 3741 64.445830212 213852 Q RM 15726595 + 1 [dd] > c 16 -- 9,10 3 3742 64.455104423 213852 Q RM 15726608 + 8 [dd] > c 8 -- 9,10 3 3743 64.463494822 213852 Q RM 15726600 + 8 [dd] Next AG. So the entire AG was written unaligned - that is expected because this is appending and that aims for contiguous allocation, not aligned allocation. > c 0 -- 9,10 3 3744 64.470414156 213852 Q WS 15728640 + 2048 [dd] And the first allocation in the next AG is properly aligned. Ok. SO it appears that something is not working 100% w.r.t. aligned allocation on the transition from one AG to the next. I wonder if we've failed the "at EOF" allocation because there isn't space in the AG and then done an "any AG" unaligned allocation as the fallback? I'll have to see if I can replicate this now I know that it appears to be the full AG -> first allocation in next AG fallback that appears to be going astray.... > > One thing to try is to set extent size hints for the directories > > these large files are going to be written to. That takes a lot of > > the allocation decisions away from the size/shape of the individual > > IO and instead does large file offset aligned/sized allocations > > which are much more likely to be stripe width aligned. e.g. set a > > extent size hint of 16MB, and the first write into a hole will > > allocate a 16MB chunk around the write instead of just the size that > > covers the write IO. > > Can you please give me a documentation pointer for that? I wasn't able > to find the right thing via searching. $ man 2 ioctl_xfs_fsgetxattr .... fsx_extsize is the preferred extent allocation size for data blocks mapped to this file, in units of filesystem blocks. If this value is zero, the filesystem will choose a default option, which is currently zero. If XFS_IOC_FSSETXATTR is called with XFS_XFLAG_EXTSIZE set in fsx_xflags and this field set to zero, the XFLAG will also be cleared. .... XFS_XFLAG_EXTSIZE Extent size bit - if a basic extent size value is set on the file then the allocator will allocate in multiples of the set size for this file (see fsx_extsize below). The extent size can only be changed on a file when it has no allocated extents. .... $ man xfs_io .... extsize [ -R | -D ] [ value ] Display and/or modify the preferred extent size used when allocating space for the currently open file. If the -R option is specified, a recursive descent is performed for all directory entries below the currently open file (-D can be used to restrict the output to directories only). If the target file is a directory, then the inherited extent size is set for that directory (new files created in that directory inherit that extent size). The value should be specified in bytes, or using one of the usual units suffixes (k, m, g, b, etc). The extent size is always reported in units of bytes. .... $ man mkfs.xfs .... extszinherit=value All inodes created by mkfs.xfs will have this value extent size hint applied. The value must be provided in units of filesystem blocks. Directories will pass on this hint to newly created regular files and directories. .... > I see some references to size hints in mkfs.xfs, but it seems like you > refer to something to be set for specific directories at run-time. It's the same thing, just set up different ways. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx