Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"

Chris Li <chrisl@xxxxxxxxxx> · Mon, 4 Mar 2024 14:36:14 -0800

On Mon, Mar 4, 2024 at 10:44 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> > Let’s start from the requirements for the swap back end.
> >
> > 1) support the existing swap usage (not the implementation).
> >
> > Some other design goals::
> >
> > 2) low per swap entry memory usage.
> >
> > 3) low io latency.
> >
> > What are the functions the swap system needs to support?
> >
> > At the device level. Swap systems need to support a list of swap files
> > with a priority order. The same priority of swap device will do round
> > robin writing on the swap device. The swap device type includes zswap,
> > zram, SSD, spinning hard disk, swap file in a file system.
> >
> > At the swap entry level, here is the list of existing swap entry usage:
> >
> > * Swap entry allocation and free. Each swap entry needs to be
> > associated with a location of the disk space in the swapfile. (offset
> > of swap entry).
> > * Each swap entry needs to track the map count of the entry. (swap_map)
> > * Each swap entry needs to be able to find the associated memory
> > cgroup. (swap_cgroup_ctrl->map)
> > * Swap cache. Lookup folio/shadow from swap entry
> > * Swap page writes through a swapfile in a file system other than a
> > block device. (swap_extent)
> > * Shadow entry. (store in swap cache)
> >
> > Any new swap back end might have different internal implementation,
> > but needs to support the above usage. For example, using the existing
> > file system as swap backend, per vma or per swap entry map to a file
> > would mean it needs additional data structure to track the
> > swap_cgroup_ctrl, combined with the size of the file inode. It would
> > be challenging to meet the design goal 2) and 3) using another file
> > system as it is..
> >
> > I am considering grouping different swap entry data into one single
> > struct and dynamically allocate it so no upfront allocation of
> > swap_map.
>
> Just some modest ideas about this ...

BTW, one trade off I am interested in hearing more discussion is the
current swap offset arrangement of the swap data structure.
There are a lot of array-like objects indexed by the swap offset
spread around different places of swap code.
You can think of the swap offset like a struct page pfn.
If we want more of memdesc type of indirection of swap entry.
Presumably get the swap entry struct pointer from swap cache or page
cache.
We will need to add two pointer per swap entry. One for pointing to
the swap entry struct, each swap entry struct needs to remember
another swap offset value. (similar to the memdesc pfn.)
So that is about 16 bytes per entry just for the indirection layer if
every swap entry is order 0.
If we have a lot of high order swap entries, we don't need to allocate
that main order 0 swap entries. We can have fewer high order swap
entries instead.

Another interesting thing I notice is that we have a lot of high order
swap entries. I think the current swap address space sharding of the
swap entries will work less effectively. All the high order swap
entries will likely be shared to the same xarray true due to offset %
64 == 0.

>
> Besides the usage, I noticed currently we already have following
> metadata reserved for SWAP:
> SWAP map (Array of char)
> SWAP shadow (XArray of pointer/long)
> SWAP cgroup map (Array of short)
> And ZSWAP has its own data.
> Also the folio->private (SWAP entry)
> PTE (SWAP entry)

One more thing to add is that shmem stores the swap entry not in PTE
but the page cache of the shmem. PTE is None for shmem.

>
> Maybe something new can combine and make better use of these, also
> reduce redundant. eg. SWAP shadow (assume it's not shrinked) contains
> cgroup info already; a folio in the swap cache having it's -> private
> pointing to SWAP entry while mapping/index are all empty; These may
> indicate some space for a smarter usage.

Keep in mind that after swap out, the folio will be removed from swap
cache. So the storage in folio->private will be gone.

>
> One easy approach might be making better use of the current swap cache
> xarray. We can never skip it even for direct swap in path (SYNC_IO),
> I'm working on it (not for a whole new swap abstraction, just trying
> to resolve some other issue and optimize things) and so far it seems
> OK. With some optimizations performance is even better than before, as
> we are already doing lookup and shadow cleaning in the current kernel.
>
> And considering XArray is capable of storing ranged data with size of
> order of 2, this gives us a nice tool to store grouped swap metadata
> for folios, and reduce memory overhead.

Yes, we can store the swap cache similar to store the file cache, the
swap address space sharding might get in the way though.

>
> Following this idea we may be able to have a smoother progressive
> transition to a better design of SWAP (eg. start with storing more
> complex things other than folio/shadow, then make it more
> backend-specified, add features bit by bit), it is more unlikely to
> break things and we can test the stability and performance step by
> step.

You kind of need a struct for swap entry and index by the swap cache.
If we don't use a continuous swap offset, then we will likely have
that two pointer overhead per swap entry I mentioned above.

> > For the swap entry allocation.Current kernel support swap out 0 order
> > or pmd order pages.
> >
> > There are some discussions and patches that add swap out for folio
> > size in between (mTHP)
> >
> > https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@xxxxxxx/
> >
> > and swap in for mTHP:
> >
> > https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@xxxxxxxxx/
> >
> > The introduction of swapping different order of pages will further
> > complicate the swap entry fragmentation issue. The swap back end has
> > no way to predict the life cycle of the swap entries. Repeat allocate
> > and free swap entry of different sizes will fragment the swap entries
> > array. If we can’t allocate the contiguous swap entry for a mTHP, it
> > will have to split the mTHP to a smaller size to perform the swap in
> > and out. T
> >
> > Current swap only supports 4K pages or pmd size pages. When adding the
> > other in between sizes, it greatly increases the chance of fragmenting
> > the swap entry space. When no more continuous swap swap entry for
> > mTHP, it will force the mTHP split into 4K pages. If we don’t solve
> > the fragmentation issue. It will be a constant source of splitting the
> > mTHP.
> >
> > Another limitation I would like to address is that swap_writepage can
> > only write out IO in one contiguous chunk, not able to perform
> > non-continuous IO. When the swapfile is close to full, it is likely
> > the unused entry will spread across different locations. It would be
> > nice to be able to read and write large folio using discontiguous disk
> > IO locations.
> >
> > Some possible ideas for the fragmentation issue.
> >
> > a) buddy allocator for swap entities. Similar to the buddy allocator
> > in memory. We can use a buddy allocator system for the swap entry to
> > avoid the low order swap entry fragment too much of the high order
> > swap entry. It should greatly reduce the fragmentation caused by
> > allocate and free of the swap entry of different sizes. However the
> > buddy allocator has its own limit as well. Unlike system memory, we
> > can move and compact the memory. There is no rmap for swap entry, it
> > is much harder to move a swap entry to another disk location. So the
> > buddy allocator for swap will help, but not solve all the
> > fragmentation issues.
> >
> > b) Large swap entries. Take file as an example, a file on the file
> > system can write to a discontinuous disk location. The file system
> > responsible for tracking how to map the file offset into disk
> > location. A large swap entry can have a similar indirection array map
> > out the disk location for different subpages within a folio.  This
> > allows a large folio to write out dis-continguos swap entries on the
> > swap file. The array will need to store somewhere as part of the
> > overhead.When allocating swap entries for the folio, we can allocate a
> > batch of smaller 4k swap entries into an array. Use this array to
> > read/write the large folio. There will be a lot of plumbing work to
> > get it to work.
> >
> > Solution a) and b) can work together as well. Only use b) if not able
> > to allocate swap entries from a).
>
> Despite the limitation, I think a) is a better approach. non-sequel
> read/write is very performance unfriendly even for ZRAM, so it will be
> better if the data is continuous in both RAM and SWAP.

ZRAM might be able to extend the interface to receive larger swap
entry writes directly, the physical offset actually doesn't make much
difference to ZRAM because everything is virtual anyway.

>
> And if not, something like VMA readahead can already help improve
> performance. But we have seen this have a negative impact with fast
> devices like ZRAM so it's disabled in the current kernel.
>
> Migration of swap entries is a good thing to have, but the migration
> cost seems too high... I don't have a better idea on this.

Likely require a backend to read in the data from one device to the
swap cache and write the next level device. Similar to the zswap write
back but read from one and write to the another one.

Chris