Swap Min Odrer

Daniel Gomez <da.gomez@xxxxxxxxxxx> · Tue, 7 Jan 2025 10:43:47 +0100

Hi,

High-capacity SSDs require writes to be aligned with the drive's
indirection unit (IU), which is typically >4 KiB, to avoid RMW. To
support swap on these devices, we need to ensure that writes do not
cross IU boundaries. So, I think this may require increasing the minimum
allocation size for swap users.

As a temporary alternative, a proposal [1] to prevent swap on these
devices was previously sent for discussion before LBS was merged
in v6.12 [2]. Additional details and reasoning can be found in [1]
discussion.

[1] https://lore.kernel.org/all/20240627000924.2074949-1-mcgrof@xxxxxxxxxx/
[2] https://lore.kernel.org/all/20240913-vfs-blocksize-ab40822b2366@brauner/

So, I’d like to bring this up for discussion here and/or propose it as
a topic for the next MM bi-weekly meeting if needed. Please let me know
if this has already been discussed previously. Given that we already
support large folios with mTHP in anon memory and shmem, a similar
approach where we avoid falling back to smaller allocations might
suffice, as it is done in the page cache with min order.

Monitoring writes on a dedicated NVMe with swap enabled with blkalgn
tool [3], I get the following results:

[3] https://github.com/iovisor/bcc/pull/5128

Swap setup:
mkdir -p /mnt/swap
sudo mkfs.xfs -b size=16k /dev/nvme0n1 -f
sudo mount --types xfs /dev/nvme0n1 /mnt/swap

sudo fallocate -l 8192M /mnt/swap/swapfile
sudo chmod 600 /mnt/swap/swapfile
sudo mkswap /mnt/swap/swapfile
sudo swapon /mnt/swap/swapfile

Swap stress test (guest with 7.8Gi of RAM):
stress --vm-bytes 7859M --vm-keep -m 1 --timeout 300

Results:
1. Vanilla v6.12 no mTHP enabled

I/O Alignment Histogram for Device nvme0n1
     bytes               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 3255     |****************************************|
      8192 -> 16383      : 783      |*********                               |
     16384 -> 32767      : 255      |***                                     |
     32768 -> 65535      : 61       |                                        |
     65536 -> 131071     : 24       |                                        |
    131072 -> 262143     : 22       |                                        |
    262144 -> 524287     : 2136     |**************************              |

The above represents the alignment of writes in power-of-2 steps for the
swap dedicated nvme0n1 device. The corresponding granularity for these
alignments is shown in the linear histogram below, where the sector
size is 512 Bytes (e.g. for a sector size 8: 8 << 9: 4096 Bytes). So
the first count indicates that 821 writes where sent with a size of 4
KiB, and the last one shows that 2441 writes where sent with a size of
512 KiB.

I/O Granularity Histogram for Device nvme0n1
Total I/Os: 6536
     sector        : count     distribution
        8          : 821      |*************                           |
        16         : 131      |**                                      |
        24         : 339      |*****                                   |
        32         : 259      |****                                    |
        40         : 114      |*                                       |
        48         : 162      |**                                      |
        56         : 249      |****                                    |
        64         : 257      |****                                    |
        72         : 157      |**                                      |
        80         : 90       |*                                       |
        88         : 109      |*                                       |
        96         : 188      |***                                     |
        104        : 228      |***                                     |
        112        : 262      |****                                    |
        120        : 81       |*                                       |
        128        : 44       |                                        |
        136        : 22       |                                        |
        144        : 20       |                                        |
        152        : 20       |                                        |
        160        : 18       |                                        |
        168        : 43       |                                        |
        176        : 9        |                                        |
        184        : 5        |                                        |
        192        : 2        |                                        |
        200        : 3        |                                        |
        208        : 2        |                                        |
        216        : 4        |                                        |
        224        : 6        |                                        |
        232        : 4        |                                        |
        240        : 2        |                                        |
        248        : 11       |                                        |
        256        : 9        |                                        |
        264        : 17       |                                        |
        272        : 19       |                                        |
        280        : 16       |                                        |
        288        : 7        |                                        |
        296        : 5        |                                        |
        304        : 2        |                                        |
        312        : 7        |                                        |
        320        : 5        |                                        |
        328        : 4        |                                        |
        336        : 23       |                                        |
        344        : 2        |                                        |
        352        : 12       |                                        |
        360        : 5        |                                        |
        368        : 5        |                                        |
        376        : 1        |                                        |
        384        : 3        |                                        |
        392        : 3        |                                        |
        400        : 2        |                                        |
        408        : 1        |                                        |
        416        : 1        |                                        |
        424        : 6        |                                        |
        432        : 5        |                                        |
        440        : 3        |                                        |
        448        : 7        |                                        |
        456        : 2        |                                        |
        472        : 2        |                                        |
        480        : 2        |                                        |
        488        : 7        |                                        |
        496        : 5        |                                        |
        504        : 11       |                                        |
        520        : 3        |                                        |
        528        : 1        |                                        |
        536        : 2        |                                        |
        544        : 5        |                                        |
        560        : 1        |                                        |
        568        : 2        |                                        |
        576        : 1        |                                        |
        584        : 2        |                                        |
        592        : 2        |                                        |
        600        : 2        |                                        |
        608        : 1        |                                        |
        616        : 2        |                                        |
        624        : 5        |                                        |
        632        : 1        |                                        |
        640        : 1        |                                        |
        648        : 1        |                                        |
        656        : 5        |                                        |
        664        : 8        |                                        |
        672        : 20       |                                        |
        680        : 3        |                                        |
        688        : 1        |                                        |
        704        : 1        |                                        |
        712        : 1        |                                        |
        720        : 3        |                                        |
        728        : 4        |                                        |
        736        : 6        |                                        |
        744        : 14       |                                        |
        752        : 14       |                                        |
        760        : 12       |                                        |
        768        : 3        |                                        |
        776        : 5        |                                        |
        784        : 2        |                                        |
        792        : 2        |                                        |
        800        : 1        |                                        |
        808        : 3        |                                        |
        816        : 1        |                                        |
        824        : 5        |                                        |
        832        : 2        |                                        |
        840        : 15       |                                        |
        848        : 9        |                                        |
        856        : 2        |                                        |
        864        : 1        |                                        |
        872        : 2        |                                        |
        880        : 10       |                                        |
        888        : 4        |                                        |
        896        : 5        |                                        |
        904        : 1        |                                        |
        920        : 2        |                                        |
        936        : 3        |                                        |
        944        : 1        |                                        |
        952        : 6        |                                        |
        960        : 1        |                                        |
        968        : 1        |                                        |
        976        : 1        |                                        |
        984        : 1        |                                        |
        992        : 2        |                                        |
        1000       : 2        |                                        |
        1008       : 16       |                                        |
        1016       : 1        |                                        |
        1024       : 2441     |****************************************|

2. Vanilla v6.12 with all mTHP enabled:

I/O Alignment Histogram for Device nvme0n1
     bytes               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 5076     |****************************************|
      8192 -> 16383      : 907      |*******                                 |
     16384 -> 32767      : 302      |**                                      |
     32768 -> 65535      : 141      |*                                       |
     65536 -> 131071     : 46       |                                        |
    131072 -> 262143     : 35       |                                        |
    262144 -> 524287     : 1993     |***************                         |
    524288 -> 1048575    : 6        |                                        |

In addition, I've tested and monitored writes enabling SWP_BLKDEV for
regular files to allow large folios for swap files on block devices and
check the difference:

diff --git a/mm/swapfile.c b/mm/swapfile.c
index b0a9071cfe1d..80a9dbe9645a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3128,6 +3128,7 @@ static int claim_swapfile(struct swap_info_struct *si, struct inode *inode)
                si->flags |= SWP_BLKDEV;
        } else if (S_ISREG(inode->i_mode)) {
                si->bdev = inode->i_sb->s_bdev;
+               si->flags |= SWP_BLKDEV;
        }

        return 0;

With the following aligment results:

3. v6.12 + SWP_BLKDEV change with mTHP disabled:

I/O Alignment Histogram for Device nvme0n1
     bytes               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 146      |*****                                   |
      8192 -> 16383      : 23       |                                        |
     16384 -> 32767      : 10       |                                        |
     32768 -> 65535      : 1        |                                        |
     65536 -> 131071     : 3        |                                        |
    131072 -> 262143     : 0        |                                        |
    262144 -> 524287     : 1020     |****************************************|

4. v6.12 + SWP_BLKDEV change with mTHP enabled:

I/O Alignment Histogram for Device nvme0n1
     bytes               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 240      |******                                  |
      8192 -> 16383      : 34       |                                        |
     16384 -> 32767      : 4        |                                        |
     32768 -> 65535      : 0        |                                        |
     65536 -> 131071     : 1        |                                        |
    131072 -> 262143     : 1        |                                        |
    262144 -> 524287     : 1542     |****************************************|

2nd run:

I/O Alignment Histogram for Device nvme0n1
     bytes               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 356      |************                            |
      8192 -> 16383      : 74       |**                                      |
     16384 -> 32767      : 58       |**                                      |
     32768 -> 65535      : 54       |*                                       |
     65536 -> 131071     : 37       |*                                       |
    131072 -> 262143     : 11       |                                        |
    262144 -> 524287     : 1104     |****************************************|
    524288 -> 1048575    : 1        |                                        |

For comparison, the graph below represents a stress test on a drive with
LBS enabled (XFS with 16k block size) with random size writes:

I/O Alignment Histogram for Device nvme0n1
     Bytes               : count    distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 1758     |*                                       |
      1024 -> 2047       : 476      |                                        |
      2048 -> 4095       : 164      |                                        |
      4096 -> 8191       : 42       |                                        |
      8192 -> 16383      : 10       |                                        |
     16384 -> 32767      : 3629     |***                                     |
     32768 -> 65535      : 47861    |****************************************|
     65536 -> 131071     : 25702    |*********************                   |
    131072 -> 262143     : 10791    |*********                               |
    262144 -> 524287     : 11094    |*********                               |
    524288 -> 1048575    : 55       |                                        |

The test drive here uses a 512 Byte LBA format and so, writes can start
at that boundary. However, LBS/min order allows most of the writes to
fall at 16k bounaries or greater.

What do you think?

Daniel