Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"

Chris Li <chrisl@xxxxxxxxxx> · Tue, 28 May 2024 20:36:04 -0700

On Tue, May 28, 2024 at 12:08 AM Jared Hulbert <jaredeh@xxxxxxxxx> wrote:
>
> On Tue, May 21, 2024 at 1:43 PM Chris Li <chrisl@xxxxxxxxxx> wrote:
> >
> > Swap and file systems have very different requirements and usage
> > patterns and IO patterns.
>
> I would counter that the design requirements for a simple filesystem
> and what you are proposing doing to support heterogeneously sized
> block allocation on a block device are very similar, not very
> different.
>
> Data is owned by clients, but I've done the profiling on servers and
> Android.  As I've stated before, databases have reasonably close usage
> and IO.  Swap usage of block devices is not a particularly odd usage
> profile.
>
> > One challenging aspect is that the current swap back end has a very
> > low per swap entry memory overhead. It is about 1 byte (swap_map), 2
> > byte (swap cgroup), 8 byte(swap cache pointer). The inode struct is
> > more than 64 bytes per file. That is a big jump if you map a swap
> > entry to a file. If you map more than one swap entry to a file, then
> > you need to track the mapping of file offset to swap entry, and the
> > reverse lookup of swap entry to a file with offset. Whichever way you
> > cut it, it will significantly increase the per swap entry memory
> > overhead.
>
> No it won't.  Because the suggestion is NOT to add some array of inode
> structs in place of the structures you've been talking about altering.
>
> IIUC your proposals per the "Swap Abstraction LSF_MM 2024.pdf" are to
> more than double the per entry overhead from 11 B to 24 B.  Is that
> correct? Of course if modernizing the structures to be properly folio
> aware requires a few bytes, that seems prudent.

The most expanded form of swap entry is 24B, the last option in the slide.
However, you get the saving of duplicating compound swap entries.
e.g. for PMD size of compound swap entries. You can have 512 identical
swap entries within one compound swap entry. They only need to have 8
bytes of pointer each point to the compound entry struct. So the
average of the per entry is 8B + (24B + compound struct overhead)/512,
much smaller than 24B. If all swap entries are order 0. Then yes, the
average is 24B per entry.

>
> Also IIUC 8 bytes of the 24 are a per swap entry pointer to a
> dynamically allocated structure that will be used to manage
> heterogeneous block size allocation management on block devices.  I
> object to this.  That's what the filesystem abstraction is for.  EXT4
> too heavy for you? Then make a simpler filesystem.

You can call my compound swap entry a simpler filesystem. Just a different name.
If you are writing a new file system for swap, you don't need the
inode and most of the VFS ops etc.
Those are unnecessary complexity to deal with.

>
> So how do you map swap entries to a filesystem without a new mapping
> layer?  Here is a simple proposal.  (It assumes there are only 16
> valid folio orders.  There are ways to get around that limit but it
> would take longer to explain, so let's just go with it.)
>
> * swap_types (fs inodes) map to different page sizes (page, compound
> order, folio order, mTHP size etc).

Swap type has preexisting meaning in Linux swap back end code, its
reference to the swap device.
Let me just call it "swap_order".

>    ex. swap_type == 1 -> 4K pages,    swap_type == 15 -> 1G hugepages etc
> * swap_type = fs inode
> * swap_offset = fs file offset
> * swap_offset is selected using the same simple allocation scheme as today.
>   - because the swap entries are all the same size/order per
> swap_type/inode you can just pick the first free slot.
> * on freeing a swap entry call fallocate(FALLOC_FL_PUNCH_HOLE)
>   - removes the blocks from the file without changing its "size".
>   - no changes are required to the swap_offsets to garbage collect blocks.

Can I assume your swap entry encoding is something like [swap_order
(your swap_type)] + [swap_offset]?

Let's forget the fact that you might not be able to get swap order
bits from the swap entry in a 32 bit system.
Assume the swapfile is small enough that is not a problem.
Now your swap cache address space is 16x compared to the original swap
cache address space.
You may say, oh, that is "virtual" swap cache address space, you are
not using the 16x address space at the same time.
That is true. However, you can create worse fragmentation in your 16x
virtual swap cache address space. The xarray used to track the swap
cache does not handle sparse index storage well. The worst case
fragmentation in xarray is about 32-64x. So the worst fragmentation in
your 16x swap address space can be something close to 16x end.
Let's say it is not 16x, pick a low end 4x. 4x 8B per swap cache
pointer that is already 32B per swap entry just on the swap cache
alone.

FYI, the original swap cache and the compound swap entry in the pdf do
not have this swap cache address space blow up issue.

Chris

>
> This allows you the following:
> * dynamic allocation of block space between sizes/orders
> * avoids any new tracking structures in memory for all swap entries
> * places burden of tracking on filesystem
>