Re: [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator

Kairui Song <ryncsn@xxxxxxxxx> · Wed, 5 Feb 2025 02:38:39 +0800

On Wed, Feb 5, 2025 at 2:11 AM Yosry Ahmed <yosry.ahmed@xxxxxxxxx> wrote:
>
> On Wed, Feb 05, 2025 at 12:46:26AM +0800, Kairui Song wrote:
> > On Wed, Feb 5, 2025 at 12:24 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > >
> > > Hi Kairui,
> > >
> > > On Tue, Feb 04, 2025 at 07:44:46PM +0800, Kairui Song wrote:
> > > > Hi all, sorry for the late submission.
> > > >
> > > > Following previous work and topics with the SWAP allocator
> > > > [1][2][3][4], this topic would propose a way to redesign and integrate
> > > > multiple swap data into the swap allocator, which should be a
> > > > future-proof design, achieving following benefits:
> > > > - Even lower memory usage than the current design
> > > > - Higher performance (Remove HAS_CACHE pin trampoline)
> > > > - Dynamic allocation and growth support, further reducing idle memory usage
> > > > - Unifying the swapin path for a more maintainable code base (Remove SYNC_IO)
> > > > - More extensible, provide a clean bedrock for implementing things
> > > > like discontinuous swapout, readahead based mTHP swapin and more.
> > > >
> > > > People have been complaining about the SWAP management subsystem [5].
> > > > Many incremental workarounds and optimizations are added, but causes
> > > > many other problems eg. [6][7][8][9] and making implementing new
> > > > features more difficult. One reason is the current design almost has
> > > > the minimal memory usage (1 byte swap map) with acceptable
> > > > performance, so it's hard to beat with incremental changes. But
> > > > actually as more code and features are added, there are already lots
> > > > of duplicated parts. So I'm proposing this idea to overhaul whole SWAP
> > > > slot management from a different aspect, as the following work on the
> > > > SWAP allocator [2].
> > > >
> > > > Chris's topic "Swap abstraction" at LSFMM 2024 [1] raised the idea of
> > > > unifying swap data, we worked together to implement the short term
> > > > solution first: The swap allocator was the bottleneck for performance
> > > > and fragmentation issues. The new cluster allocator solved these
> > > > issues, and turned the cluster into a basic swap management unit.
> > > > It also removed slot cache freeing path, and I'll post another series
> > > > soon to remove the slot cache allocation path, so folios will always
> > > > interact with the SWAP allocator directly, preparing for this long
> > > > term goal:
> > > >
> > > > A brief intro of the new design
> > > > ===============================
> > > >
> > > > It will first be a drop-in replacement for swap cache, using a per
> > > > cluster table to handle all things required for SWAP management.
> > > > Compared to the previous attempt to unify swap cache [11], this will
> > > > have lower overhead with more features achievable:
> > > >
> > > > struct swap_cluster_info {
> > > > spinlock_t lock;
> > > > u16 count;
> > > > u8 flags;
> > > > u8 order;
> > > > + void *table; /* 512 entries */
> > > > struct list_head list;
> > > > };
> > > >
> > > > The table itself can have variants of format, but for basic usage,
> > > > each void* could be in one of the following type:
> > > >
> > > > /*
> > > >  * a NULL:    | -----------    0    ------------| - Empty slot
> > > >  * a Shadow:  | SWAP_COUNT |---- Shadow ----|XX1| - Swaped out
> > > >  * a PFN:     | SWAP_COUNT |------ PFN -----|X10| - Cached
> > > >  * a Pointer: |----------- Pointer ---------|100| - Reserved / Unused yet
> > > > * SWAP_COUNT is still 8 bits.
> > > >  */
> > > >
> > > > Clearly it can hold both cache and swap count. The shadow still has
> > > > enough for distance (using 16M as buckets for 52 bit VA) or gen
> > > > counting. For COUNT_CONTINUED, it can simply allocate another 512
> > > > atomics for one cluster.
> > > >
> > > > The table is protected by ci->lock, which has little to none contention.
> > > > It also gets rid of the "HAS_CACHE bit setting vs Cache Insert",
> > > > "HAS_CACHE pin as trampoline" issue, deprecating SWP_SYNCHRONOUS_IO.
> > > > And remove the "multiple smaller file in one bit swapfile" design.
> > > >
> > > > It will further remove the swap cgroup map. Cached folio (stored as
> > > > PFN) or shadow can provide such info. Some careful audit and workflow
> > > > redesign might be needed.
> > > >
> > > > Each entry will be 8 bytes, smaller than current (8 bytes cache) + (2
> > > > bytes cgroup map) + (1 bytes SWAP Map) = 11 bytes.
> > > >
> > > > Shadow reclaim and high order storing are still doable too, by
> > > > introducing dense cluster tables formats. We can even optimize it
> > > > specially for shmem to have 1 bit per entry. And empty clusters can
> > > > have their table freed. This part might be optional.
> > > >
> > > > And it can have more types for supporting things like entry migrations
> > > > or virtual swapfile. The example formats above showed four types. Last
> > > > three or more bits can be used as a type indicator, as HAS_CACHE and
> > > > COUNT_CONTINUED will be gone.
> > >
> >
> > Hi Johannes
> >
> > > My understanding is that this would still tie the swap space to
> > > configured swapfiles. That aspect of the current design has more and
> > > more turned into a problem, because we now have several categories of
> > > swap entries that either permanently or for extended periods of time
> > > live in memory. Such entries should not occupy actual disk space.
> > >
> > > The oldest one is probably partially refaulted entries (where one out
> > > of N swapped page tables faults back in). We currently have to spend
> > > full pages of both memory AND disk space for these.
> > >
> > > The newest ones are zero-filled entries which are stored in a bitmap.
> > >
> > > Then there is zswap. You mention ghost swapfiles - I know some setups
> > > do this to use zswap purely for compression. But zswap is a writeback
> > > cache for real swapfiles primarily, and it is used as such. That means
> > > entries need to be able to move from the compressed pool to disk at
> > > some point, but might not for a long time. Tying the compressed pool
> > > size to disk space is hugely wasteful and an operational headache.
> > >
> > > So I think any future proof design for the swap allocator needs to
> > > decouple the virtual memory layer (page table count, swapcache, memcg
> > > linkage, shadow info) from the physical layer (swapfile slot).
> > >
> > > Can you touch on that concern?
> >
> > Yes, I fully understand your concern. The purpose of this swap table
> > design is to provide a base for building other parts, including
> > decoupling the virtual layer from the physical layer.
> >
> > The table entry can have different types, so virtual file/space can
> > leverage this too. For example the virtual layer can have something
> > like a "redirection entry" pointing to a physical device layer. Or
> > just a pointer to anything that could possibly be used (In the four
> > examples I provided one type is a pointer). A swap space will need
> > something to index its data.
> > We have already internally deployed a very similar solution for
> > multi-layer swapout, and it's working well, we are expecting to
> > upstreamly implement it and deprecate the downstream solution.
> >
> > Using an optional layer for doing so still consumes very little memory
> > (16 bytes per entry for two layers, and this might be doable just with
> > single layer), And there are setups that doesn't need a extra layer,
> > such setups can ignore that part and have only 8 bytes per entry,
> > having a very low overhead.
>
> IIUC with this design we still have a fixed-size swap space, but it's
> not directly tied to the physical swap layer (i.e. it can be backed with
> a swap slot on disk, zswap, zero-filled pages, etc). Did I get this
> right?
>
> In this case, using clusters to manage this should be an implementation
> detail that is not visible to userspace. Ideally the kernel would
> allocate more clusters dynamically as needed, and when a swap entry is
> being allocated in that cluster the kernel chooses the backing for that
> swap entry based on the available options.
>
> I see the benefit of managing things on the cluster level to reduce
> memory overhead (e.g. one lock per cluster vs. per entry), and to
> leverage existing code where it makes sense.

Yes, agree, cluster based map means we can have many empty clusters
without consuming any pre-reserved map memory. And extending the
cluster array should be doable too.

>
> However, what we should *not* do is have these clusters be tied to the
> disk swap space with the ability to redirect some entries to use
> someting like zswap. This does not fix the problem Johannes is
> describing.

Yes, a virtual swap file can have its own swap space, which is indexed
by the cache / table, and reuse all the logic. As long as we don't
dramatically change the kernel swapout path, adding a folio to
swapcache seems a very reasonable way to avoid redundant IO, and
synchronize it upon swapin/swapout, and reusing a lot of
infrastructure, even if that's a virtual file. For example a current
busy loop issue can be just fixed by leveraging the folio lock:
https://lore.kernel.org/lkml/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@xxxxxxxxxxxxxx/

The virtual file/space can be decoupled from the lower device. But the
virtual file/space's table entry can point to an underlying physical
SWAP device or some meta struct.