On Thu, Aug 03, 2023 at 10:44:31PM -0700, Corey Hickey wrote: > Hello, > > I am having a problem with write performance via direct I/O. My setup is: > * Debian Sid > * Linux 6.3.0-2 (Debian Kernel) > * 3-disk MD RAID-5 of hard disks > * XFS > > When I do large sequential writes via direct I/O, sometimes the writes are > fast, but sometimes the RAID ends up doing RMW and performance gets slow. > > If I use regular buffered I/O, then performance is better, presumably due to > the MD stripe cache. I could just use buffered writes, of course, but I am > really trying to make sure I get the alignment correct to start with. > > > I can reproduce the problem on a fresh RAID. > ----------------------------------------------------------------------- > $ sudo mdadm --create /dev/md10 -n 3 -l 5 -z 30G /dev/sd[ghi] > mdadm: largest drive (/dev/sdg) exceeds size (31457280K) by more than 1% > Continue creating array? y > mdadm: Defaulting to version 1.2 metadata > mdadm: array /dev/md10 started. > ----------------------------------------------------------------------- > For testing, I'm using "-z 30G" to limit the duration of the initial RAID > resync. > > > For XFS I can use default options: > ----------------------------------------------------------------------- > $ sudo mkfs.xfs /dev/md10 > log stripe unit (524288 bytes) is too large (maximum is 256KiB) > log stripe unit adjusted to 32KiB > meta-data=/dev/md10 isize=512 agcount=16, agsize=983040 blks So an AG size of just under 2GB. > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=1, rmapbt=0 > = reflink=1 bigtime=1 inobtcount=1 > nrext64=0 > data = bsize=4096 blocks=15728640, imaxpct=25 > = sunit=128 swidth=68352 blks ^^^^^^^^^^^^^^^^^ Something is badly broken in MD land. ..... > The default chunk size is 512K > ----------------------------------------------------------------------- > $ sudo mdadm --detail /dev/md10 | grep Chunk > Chunk Size : 512K > $ sudo blkid -i /dev/md10 > /dev/md10: MINIMUM_IO_SIZE="524288" OPTIMAL_IO_SIZE="279969792" ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Yup, that's definitely broken. > PHYSICAL_SECTOR_SIZE="512" LOGICAL_SECTOR_SIZE="512" > ----------------------------------------------------------------------- > I don't know why blkid is reporting such a large OPTIMAL_IO_SIZE. I would > expect this to be 1024K (due to two data disks in a three-disk RAID-5). Yup, it's broken. :/ > Translating into 512-byte sectors, I think the topology should be: > chunk size (sunit): 1024 sectors > stripe size (swidth): 2048 sectors Yup, or as it reports from mkfs, sunit=128 fsbs, swidth=256 fsbs. > ----------------------------------------------------------------------- > $ sudo blktrace -d /dev/md10 -o - | blkparse -i - | grep ' Q ' > 9,10 15 1 0.000000000 186548 Q WS 3829760 + 2048 [dd] > 9,10 15 3 0.021087119 186548 Q WS 3831808 + 2048 [dd] > 9,10 15 5 0.023605705 186548 Q WS 3833856 + 2048 [dd] > 9,10 15 7 0.026093572 186548 Q WS 3835904 + 2048 [dd] > 9,10 15 9 0.028595887 186548 Q WS 3837952 + 2048 [dd] > 9,10 15 11 0.031171221 186548 Q WS 3840000 + 2048 [dd] > [...] > 9,10 5 441 14.601942400 186608 Q WS 8082432 + 2048 [dd] > 9,10 5 443 14.620316654 186608 Q WS 8084480 + 2048 [dd] > 9,10 5 445 14.646707430 186608 Q WS 8086528 + 2048 [dd] > 9,10 5 447 14.654519976 186608 Q WS 8088576 + 2048 [dd] > 9,10 5 449 14.680901605 186608 Q WS 8090624 + 2048 [dd] > 9,10 5 451 14.689156421 186608 Q WS 8092672 + 2048 [dd] > 9,10 5 453 14.706529362 186608 Q WS 8094720 + 2048 [dd] > 9,10 5 455 14.732451407 186608 Q WS 8096768 + 2048 [dd] > ----------------------------------------------------------------------- > In the beginning, writes queued are stripe-aligned. For example: > 3829760 / 2048 == 1870 > > Later on, writes end up getting misaligned by half a stripe. For example: > 8082432 / 2048 == 3946.5 So it's aligned to sunit, not swidth. That will match up with a discontiguity in the file layout. i.e. an extent boundary. And given this is at just under 4GB written, and the AG size is just under 2GB, this discontiguity is going to occur as writing fills AG 1 and allocation switches to AG 2. > I tried manually specifying '-d sunit=1024,swidth=2048' for mkfs.xfs, but > that had pretty much the same behavior when writing (the RMW starts later, > but it still starts). It won't change anything, actually. The first allocation in an AG will determine which stripe unit the new extent starts on, and then for the entire AG the write will be aligned to that choice. If you do IOs much larger than the stripe width (e.g. 16MB at a time) the impact of the head/tail RMW will largely go away. The problem is that you are doing exactly stripe width sized IOs and so is the worse case for any allocation misalignment that might occur. > Am I doing something wrong, or is there a bug, or are my expectations > incorrect? I had expected that large sequential writes would be aligned with > swidth. Expectations are wrong. Large allocations are aligned to stripe unit in XFS by default. THis is because XFS was tuned for *large* multi-layer RAID setups like RAID-50 that had hardware RAID 5 luns stripe together via RAID-0 in the volume manager. In these setups, the stripe unit is the hardware RAID-5 lun stripe width (the minimum size that avoids RMW) and the stripe width is the RAID-0 width. Hence for performance, it didn't matter which sunit allocation aligned to as long as writes spanned the entire stripe width. That way they would hit every lun. In general, we don't want stripe width aligned allocation, because that hot-spots the first stripe unit in the stripe as all file data first writes to that unit. A raid stripe is only as fast as it's slowest disk, and so having a hot stripe unit slows everything down. Hence by default we move the initial allocation around the stripe units, and that largely removes the hotspots in the RAID luns... So, yeah, there are good reasons for stripe unit aligned allocation rather than stripe width aligned. The problem is that MD has never behaved this way - it has always exposed it's individual disk chunk size as the minimum IO size (i.e. the stripe unit) and the stripe width as the optimal IO size to avoid RMW cycles. If you want to force XFS to do stripe width aligned allocation for large files to match with how MD exposes it's topology to filesytsems, use the 'swalloc' mount option. The down side is that you'll hotspot the first disk in the MD array.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx