Re: [LSF/MM/BPF TOPIC] Integrate Swap Cache, Swap Maps with Swap Allocator

Yosry Ahmed <yosry.ahmed@xxxxxxxxx> · Tue, 4 Feb 2025 18:11:18 +0000

On Wed, Feb 05, 2025 at 12:46:26AM +0800, Kairui Song wrote:
> On Wed, Feb 5, 2025 at 12:24 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> >
> > Hi Kairui,
> >
> > On Tue, Feb 04, 2025 at 07:44:46PM +0800, Kairui Song wrote:
> > > Hi all, sorry for the late submission.
> > >
> > > Following previous work and topics with the SWAP allocator
> > > [1][2][3][4], this topic would propose a way to redesign and integrate
> > > multiple swap data into the swap allocator, which should be a
> > > future-proof design, achieving following benefits:
> > > - Even lower memory usage than the current design
> > > - Higher performance (Remove HAS_CACHE pin trampoline)
> > > - Dynamic allocation and growth support, further reducing idle memory usage
> > > - Unifying the swapin path for a more maintainable code base (Remove SYNC_IO)
> > > - More extensible, provide a clean bedrock for implementing things
> > > like discontinuous swapout, readahead based mTHP swapin and more.
> > >
> > > People have been complaining about the SWAP management subsystem [5].
> > > Many incremental workarounds and optimizations are added, but causes
> > > many other problems eg. [6][7][8][9] and making implementing new
> > > features more difficult. One reason is the current design almost has
> > > the minimal memory usage (1 byte swap map) with acceptable
> > > performance, so it's hard to beat with incremental changes. But
> > > actually as more code and features are added, there are already lots
> > > of duplicated parts. So I'm proposing this idea to overhaul whole SWAP
> > > slot management from a different aspect, as the following work on the
> > > SWAP allocator [2].
> > >
> > > Chris's topic "Swap abstraction" at LSFMM 2024 [1] raised the idea of
> > > unifying swap data, we worked together to implement the short term
> > > solution first: The swap allocator was the bottleneck for performance
> > > and fragmentation issues. The new cluster allocator solved these
> > > issues, and turned the cluster into a basic swap management unit.
> > > It also removed slot cache freeing path, and I'll post another series
> > > soon to remove the slot cache allocation path, so folios will always
> > > interact with the SWAP allocator directly, preparing for this long
> > > term goal:
> > >
> > > A brief intro of the new design
> > > ===============================
> > >
> > > It will first be a drop-in replacement for swap cache, using a per
> > > cluster table to handle all things required for SWAP management.
> > > Compared to the previous attempt to unify swap cache [11], this will
> > > have lower overhead with more features achievable:
> > >
> > > struct swap_cluster_info {
> > > spinlock_t lock;
> > > u16 count;
> > > u8 flags;
> > > u8 order;
> > > + void *table; /* 512 entries */
> > > struct list_head list;
> > > };
> > >
> > > The table itself can have variants of format, but for basic usage,
> > > each void* could be in one of the following type:
> > >
> > > /*
> > >  * a NULL:    | -----------    0    ------------| - Empty slot
> > >  * a Shadow:  | SWAP_COUNT |---- Shadow ----|XX1| - Swaped out
> > >  * a PFN:     | SWAP_COUNT |------ PFN -----|X10| - Cached
> > >  * a Pointer: |----------- Pointer ---------|100| - Reserved / Unused yet
> > > * SWAP_COUNT is still 8 bits.
> > >  */
> > >
> > > Clearly it can hold both cache and swap count. The shadow still has
> > > enough for distance (using 16M as buckets for 52 bit VA) or gen
> > > counting. For COUNT_CONTINUED, it can simply allocate another 512
> > > atomics for one cluster.
> > >
> > > The table is protected by ci->lock, which has little to none contention.
> > > It also gets rid of the "HAS_CACHE bit setting vs Cache Insert",
> > > "HAS_CACHE pin as trampoline" issue, deprecating SWP_SYNCHRONOUS_IO.
> > > And remove the "multiple smaller file in one bit swapfile" design.
> > >
> > > It will further remove the swap cgroup map. Cached folio (stored as
> > > PFN) or shadow can provide such info. Some careful audit and workflow
> > > redesign might be needed.
> > >
> > > Each entry will be 8 bytes, smaller than current (8 bytes cache) + (2
> > > bytes cgroup map) + (1 bytes SWAP Map) = 11 bytes.
> > >
> > > Shadow reclaim and high order storing are still doable too, by
> > > introducing dense cluster tables formats. We can even optimize it
> > > specially for shmem to have 1 bit per entry. And empty clusters can
> > > have their table freed. This part might be optional.
> > >
> > > And it can have more types for supporting things like entry migrations
> > > or virtual swapfile. The example formats above showed four types. Last
> > > three or more bits can be used as a type indicator, as HAS_CACHE and
> > > COUNT_CONTINUED will be gone.
> >
> 
> Hi Johannes
> 
> > My understanding is that this would still tie the swap space to
> > configured swapfiles. That aspect of the current design has more and
> > more turned into a problem, because we now have several categories of
> > swap entries that either permanently or for extended periods of time
> > live in memory. Such entries should not occupy actual disk space.
> >
> > The oldest one is probably partially refaulted entries (where one out
> > of N swapped page tables faults back in). We currently have to spend
> > full pages of both memory AND disk space for these.
> >
> > The newest ones are zero-filled entries which are stored in a bitmap.
> >
> > Then there is zswap. You mention ghost swapfiles - I know some setups
> > do this to use zswap purely for compression. But zswap is a writeback
> > cache for real swapfiles primarily, and it is used as such. That means
> > entries need to be able to move from the compressed pool to disk at
> > some point, but might not for a long time. Tying the compressed pool
> > size to disk space is hugely wasteful and an operational headache.
> >
> > So I think any future proof design for the swap allocator needs to
> > decouple the virtual memory layer (page table count, swapcache, memcg
> > linkage, shadow info) from the physical layer (swapfile slot).
> >
> > Can you touch on that concern?
> 
> Yes, I fully understand your concern. The purpose of this swap table
> design is to provide a base for building other parts, including
> decoupling the virtual layer from the physical layer.
> 
> The table entry can have different types, so virtual file/space can
> leverage this too. For example the virtual layer can have something
> like a "redirection entry" pointing to a physical device layer. Or
> just a pointer to anything that could possibly be used (In the four
> examples I provided one type is a pointer). A swap space will need
> something to index its data.
> We have already internally deployed a very similar solution for
> multi-layer swapout, and it's working well, we are expecting to
> upstreamly implement it and deprecate the downstream solution.
> 
> Using an optional layer for doing so still consumes very little memory
> (16 bytes per entry for two layers, and this might be doable just with
> single layer), And there are setups that doesn't need a extra layer,
> such setups can ignore that part and have only 8 bytes per entry,
> having a very low overhead.

IIUC with this design we still have a fixed-size swap space, but it's
not directly tied to the physical swap layer (i.e. it can be backed with
a swap slot on disk, zswap, zero-filled pages, etc). Did I get this
right?

In this case, using clusters to manage this should be an implementation
detail that is not visible to userspace. Ideally the kernel would
allocate more clusters dynamically as needed, and when a swap entry is
being allocated in that cluster the kernel chooses the backing for that
swap entry based on the available options.

I see the benefit of managing things on the cluster level to reduce
memory overhead (e.g. one lock per cluster vs. per entry), and to
leverage existing code where it makes sense.

However, what we should *not* do is have these clusters be tied to the
disk swap space with the ability to redirect some entries to use
someting like zswap. This does not fix the problem Johannes is
describing.