Hi, High-capacity SSDs require writes to be aligned with the drive's indirection unit (IU), which is typically >4 KiB, to avoid RMW. To support swap on these devices, we need to ensure that writes do not cross IU boundaries. So, I think this may require increasing the minimum allocation size for swap users. As a temporary alternative, a proposal [1] to prevent swap on these devices was previously sent for discussion before LBS was merged in v6.12 [2]. Additional details and reasoning can be found in [1] discussion. [1] https://lore.kernel.org/all/20240627000924.2074949-1-mcgrof@xxxxxxxxxx/ [2] https://lore.kernel.org/all/20240913-vfs-blocksize-ab40822b2366@brauner/ So, I’d like to bring this up for discussion here and/or propose it as a topic for the next MM bi-weekly meeting if needed. Please let me know if this has already been discussed previously. Given that we already support large folios with mTHP in anon memory and shmem, a similar approach where we avoid falling back to smaller allocations might suffice, as it is done in the page cache with min order. Monitoring writes on a dedicated NVMe with swap enabled with blkalgn tool [3], I get the following results: [3] https://github.com/iovisor/bcc/pull/5128 Swap setup: mkdir -p /mnt/swap sudo mkfs.xfs -b size=16k /dev/nvme0n1 -f sudo mount --types xfs /dev/nvme0n1 /mnt/swap sudo fallocate -l 8192M /mnt/swap/swapfile sudo chmod 600 /mnt/swap/swapfile sudo mkswap /mnt/swap/swapfile sudo swapon /mnt/swap/swapfile Swap stress test (guest with 7.8Gi of RAM): stress --vm-bytes 7859M --vm-keep -m 1 --timeout 300 Results: 1. Vanilla v6.12 no mTHP enabled I/O Alignment Histogram for Device nvme0n1 bytes : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 0 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 3255 |****************************************| 8192 -> 16383 : 783 |********* | 16384 -> 32767 : 255 |*** | 32768 -> 65535 : 61 | | 65536 -> 131071 : 24 | | 131072 -> 262143 : 22 | | 262144 -> 524287 : 2136 |************************** | The above represents the alignment of writes in power-of-2 steps for the swap dedicated nvme0n1 device. The corresponding granularity for these alignments is shown in the linear histogram below, where the sector size is 512 Bytes (e.g. for a sector size 8: 8 << 9: 4096 Bytes). So the first count indicates that 821 writes where sent with a size of 4 KiB, and the last one shows that 2441 writes where sent with a size of 512 KiB. I/O Granularity Histogram for Device nvme0n1 Total I/Os: 6536 sector : count distribution 8 : 821 |************* | 16 : 131 |** | 24 : 339 |***** | 32 : 259 |**** | 40 : 114 |* | 48 : 162 |** | 56 : 249 |**** | 64 : 257 |**** | 72 : 157 |** | 80 : 90 |* | 88 : 109 |* | 96 : 188 |*** | 104 : 228 |*** | 112 : 262 |**** | 120 : 81 |* | 128 : 44 | | 136 : 22 | | 144 : 20 | | 152 : 20 | | 160 : 18 | | 168 : 43 | | 176 : 9 | | 184 : 5 | | 192 : 2 | | 200 : 3 | | 208 : 2 | | 216 : 4 | | 224 : 6 | | 232 : 4 | | 240 : 2 | | 248 : 11 | | 256 : 9 | | 264 : 17 | | 272 : 19 | | 280 : 16 | | 288 : 7 | | 296 : 5 | | 304 : 2 | | 312 : 7 | | 320 : 5 | | 328 : 4 | | 336 : 23 | | 344 : 2 | | 352 : 12 | | 360 : 5 | | 368 : 5 | | 376 : 1 | | 384 : 3 | | 392 : 3 | | 400 : 2 | | 408 : 1 | | 416 : 1 | | 424 : 6 | | 432 : 5 | | 440 : 3 | | 448 : 7 | | 456 : 2 | | 472 : 2 | | 480 : 2 | | 488 : 7 | | 496 : 5 | | 504 : 11 | | 520 : 3 | | 528 : 1 | | 536 : 2 | | 544 : 5 | | 560 : 1 | | 568 : 2 | | 576 : 1 | | 584 : 2 | | 592 : 2 | | 600 : 2 | | 608 : 1 | | 616 : 2 | | 624 : 5 | | 632 : 1 | | 640 : 1 | | 648 : 1 | | 656 : 5 | | 664 : 8 | | 672 : 20 | | 680 : 3 | | 688 : 1 | | 704 : 1 | | 712 : 1 | | 720 : 3 | | 728 : 4 | | 736 : 6 | | 744 : 14 | | 752 : 14 | | 760 : 12 | | 768 : 3 | | 776 : 5 | | 784 : 2 | | 792 : 2 | | 800 : 1 | | 808 : 3 | | 816 : 1 | | 824 : 5 | | 832 : 2 | | 840 : 15 | | 848 : 9 | | 856 : 2 | | 864 : 1 | | 872 : 2 | | 880 : 10 | | 888 : 4 | | 896 : 5 | | 904 : 1 | | 920 : 2 | | 936 : 3 | | 944 : 1 | | 952 : 6 | | 960 : 1 | | 968 : 1 | | 976 : 1 | | 984 : 1 | | 992 : 2 | | 1000 : 2 | | 1008 : 16 | | 1016 : 1 | | 1024 : 2441 |****************************************| 2. Vanilla v6.12 with all mTHP enabled: I/O Alignment Histogram for Device nvme0n1 bytes : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 0 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 5076 |****************************************| 8192 -> 16383 : 907 |******* | 16384 -> 32767 : 302 |** | 32768 -> 65535 : 141 |* | 65536 -> 131071 : 46 | | 131072 -> 262143 : 35 | | 262144 -> 524287 : 1993 |*************** | 524288 -> 1048575 : 6 | | In addition, I've tested and monitored writes enabling SWP_BLKDEV for regular files to allow large folios for swap files on block devices and check the difference: diff --git a/mm/swapfile.c b/mm/swapfile.c index b0a9071cfe1d..80a9dbe9645a 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3128,6 +3128,7 @@ static int claim_swapfile(struct swap_info_struct *si, struct inode *inode) si->flags |= SWP_BLKDEV; } else if (S_ISREG(inode->i_mode)) { si->bdev = inode->i_sb->s_bdev; + si->flags |= SWP_BLKDEV; } return 0; With the following aligment results: 3. v6.12 + SWP_BLKDEV change with mTHP disabled: I/O Alignment Histogram for Device nvme0n1 bytes : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 0 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 146 |***** | 8192 -> 16383 : 23 | | 16384 -> 32767 : 10 | | 32768 -> 65535 : 1 | | 65536 -> 131071 : 3 | | 131072 -> 262143 : 0 | | 262144 -> 524287 : 1020 |****************************************| 4. v6.12 + SWP_BLKDEV change with mTHP enabled: I/O Alignment Histogram for Device nvme0n1 bytes : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 0 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 240 |****** | 8192 -> 16383 : 34 | | 16384 -> 32767 : 4 | | 32768 -> 65535 : 0 | | 65536 -> 131071 : 1 | | 131072 -> 262143 : 1 | | 262144 -> 524287 : 1542 |****************************************| 2nd run: I/O Alignment Histogram for Device nvme0n1 bytes : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 0 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 356 |************ | 8192 -> 16383 : 74 |** | 16384 -> 32767 : 58 |** | 32768 -> 65535 : 54 |* | 65536 -> 131071 : 37 |* | 131072 -> 262143 : 11 | | 262144 -> 524287 : 1104 |****************************************| 524288 -> 1048575 : 1 | | For comparison, the graph below represents a stress test on a drive with LBS enabled (XFS with 16k block size) with random size writes: I/O Alignment Histogram for Device nvme0n1 Bytes : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 1758 |* | 1024 -> 2047 : 476 | | 2048 -> 4095 : 164 | | 4096 -> 8191 : 42 | | 8192 -> 16383 : 10 | | 16384 -> 32767 : 3629 |*** | 32768 -> 65535 : 47861 |****************************************| 65536 -> 131071 : 25702 |********************* | 131072 -> 262143 : 10791 |********* | 262144 -> 524287 : 11094 |********* | 524288 -> 1048575 : 55 | | The test drive here uses a 512 Byte LBA format and so, writes can start at that boundary. However, LBS/min order allows most of the writes to fall at 16k bounaries or greater. What do you think? Daniel