On Tue, May 21, 2024 at 01:40:56PM -0700, Chris Li wrote: > > Filesystems already implemented a lot of solutions for fragmentation > > avoidance that are more apropriate for slow storage media. > > Swap and file systems have very different requirements and usage > patterns and IO patterns. Should they, though? Filesystems noticed that handling pages in LRU order was inefficient and so they stopped doing that (see the removal of aops->writepage in favour of ->writepages, along with where each are called from). Maybe it's time for swap to start doing writes in the order of virtual addresses within a VMA, instead of LRU order. Indeed, if we're open to radical ideas, the LRU sucks. A physical scan is 40x faster: https://lore.kernel.org/linux-mm/ZTc7SHQ4RbPkD3eZ@xxxxxxxxxxxxxxxxxxxx/ > One challenging aspect is that the current swap back end has a very > low per swap entry memory overhead. It is about 1 byte (swap_map), 2 > byte (swap cgroup), 8 byte(swap cache pointer). The inode struct is > more than 64 bytes per file. That is a big jump if you map a swap > entry to a file. If you map more than one swap entry to a file, then > you need to track the mapping of file offset to swap entry, and the > reverse lookup of swap entry to a file with offset. Whichever way you > cut it, it will significantly increase the per swap entry memory > overhead. Not necessarily, no. If your workload uses a lot of order-2, order-4 and order-9 folios, then the current scheme is using 11 bytes per page, so 44 bytes per order-2 folio, 176 per order-4 folio and 5632 per order-9 folio. That's a lot of bytes we can use for an extent-based scheme. Also, why would you compare the size of an inode to the size of an inode? inode is ~equivalent to an anon_vma, not to a swap entry.