Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Wed, 29 May 2024 04:57:33 +0100

On Tue, May 21, 2024 at 01:40:56PM -0700, Chris Li wrote:
> > Filesystems already implemented a lot of solutions for fragmentation
> > avoidance that are more apropriate for slow storage media.
> 
> Swap and file systems have very different requirements and usage
> patterns and IO patterns.

Should they, though?  Filesystems noticed that handling pages in LRU
order was inefficient and so they stopped doing that (see the removal
of aops->writepage in favour of ->writepages, along with where each are
called from).  Maybe it's time for swap to start doing writes in the order
of virtual addresses within a VMA, instead of LRU order.

Indeed, if we're open to radical ideas, the LRU sucks.  A physical scan
is 40x faster:
https://lore.kernel.org/linux-mm/ZTc7SHQ4RbPkD3eZ@xxxxxxxxxxxxxxxxxxxx/

> One challenging aspect is that the current swap back end has a very
> low per swap entry memory overhead. It is about 1 byte (swap_map), 2
> byte (swap cgroup), 8 byte(swap cache pointer). The inode struct is
> more than 64 bytes per file. That is a big jump if you map a swap
> entry to a file. If you map more than one swap entry to a file, then
> you need to track the mapping of file offset to swap entry, and the
> reverse lookup of swap entry to a file with offset. Whichever way you
> cut it, it will significantly increase the per swap entry memory
> overhead.

Not necessarily, no.  If your workload uses a lot of order-2, order-4
and order-9 folios, then the current scheme is using 11 bytes per page,
so 44 bytes per order-2 folio, 176 per order-4 folio and 5632 per
order-9 folio.  That's a lot of bytes we can use for an extent-based
scheme.

Also, why would you compare the size of an inode to the size of an
inode?  inode is ~equivalent to an anon_vma, not to a swap entry.