Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"

Jared Hulbert <jaredeh@xxxxxxxxx> · Tue, 28 May 2024 00:08:12 -0700

On Tue, May 21, 2024 at 1:43 PM Chris Li <chrisl@xxxxxxxxxx> wrote:
>
> Swap and file systems have very different requirements and usage
> patterns and IO patterns.

I would counter that the design requirements for a simple filesystem
and what you are proposing doing to support heterogeneously sized
block allocation on a block device are very similar, not very
different.

Data is owned by clients, but I've done the profiling on servers and
Android.  As I've stated before, databases have reasonably close usage
and IO.  Swap usage of block devices is not a particularly odd usage
profile.

> One challenging aspect is that the current swap back end has a very
> low per swap entry memory overhead. It is about 1 byte (swap_map), 2
> byte (swap cgroup), 8 byte(swap cache pointer). The inode struct is
> more than 64 bytes per file. That is a big jump if you map a swap
> entry to a file. If you map more than one swap entry to a file, then
> you need to track the mapping of file offset to swap entry, and the
> reverse lookup of swap entry to a file with offset. Whichever way you
> cut it, it will significantly increase the per swap entry memory
> overhead.

No it won't.  Because the suggestion is NOT to add some array of inode
structs in place of the structures you've been talking about altering.

IIUC your proposals per the "Swap Abstraction LSF_MM 2024.pdf" are to
more than double the per entry overhead from 11 B to 24 B.  Is that
correct? Of course if modernizing the structures to be properly folio
aware requires a few bytes, that seems prudent.

Also IIUC 8 bytes of the 24 are a per swap entry pointer to a
dynamically allocated structure that will be used to manage
heterogeneous block size allocation management on block devices.  I
object to this.  That's what the filesystem abstraction is for.  EXT4
too heavy for you? Then make a simpler filesystem.

So how do you map swap entries to a filesystem without a new mapping
layer?  Here is a simple proposal.  (It assumes there are only 16
valid folio orders.  There are ways to get around that limit but it
would take longer to explain, so let's just go with it.)

* swap_types (fs inodes) map to different page sizes (page, compound
order, folio order, mTHP size etc).
   ex. swap_type == 1 -> 4K pages,    swap_type == 15 -> 1G hugepages etc
* swap_type = fs inode
* swap_offset = fs file offset
* swap_offset is selected using the same simple allocation scheme as today.
  - because the swap entries are all the same size/order per
swap_type/inode you can just pick the first free slot.
* on freeing a swap entry call fallocate(FALLOC_FL_PUNCH_HOLE)
  - removes the blocks from the file without changing its "size".
  - no changes are required to the swap_offsets to garbage collect blocks.

This allows you the following:
* dynamic allocation of block space between sizes/orders
* avoids any new tracking structures in memory for all swap entries
* places burden of tracking on filesystem