Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Mar 1, 2024 at 10:53 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> On Fri, Mar 1, 2024 at 4:24 PM Chris Li <chrisl@xxxxxxxxxx> wrote:
> >
> > In last year's LSF/MM I talked about a VFS-like swap system. That is
> > the pony that was chosen.
> > However, I did not have much chance to go into details.
>
> I'd love to attend this talk/chat :)
>
> >
> > This year, I would like to discuss what it takes to re-architect the
> > whole swap back end from scratch?
> >
> > Let’s start from the requirements for the swap back end.
> >
> > 1) support the existing swap usage (not the implementation).
> >
> > Some other design goals::
> >
> > 2) low per swap entry memory usage.
> >
> > 3) low io latency.
> >
> > What are the functions the swap system needs to support?
> >
> > At the device level. Swap systems need to support a list of swap files
> > with a priority order. The same priority of swap device will do round
> > robin writing on the swap device. The swap device type includes zswap,
> > zram, SSD, spinning hard disk, swap file in a file system.
> >
> > At the swap entry level, here is the list of existing swap entry usage:
> >
> > * Swap entry allocation and free. Each swap entry needs to be
> > associated with a location of the disk space in the swapfile. (offset
> > of swap entry).
> > * Each swap entry needs to track the map count of the entry. (swap_map)
> > * Each swap entry needs to be able to find the associated memory
> > cgroup. (swap_cgroup_ctrl->map)
> > * Swap cache. Lookup folio/shadow from swap entry
> > * Swap page writes through a swapfile in a file system other than a
> > block device. (swap_extent)
> > * Shadow entry. (store in swap cache)
>
> IMHO, one thing this new abstraction should support is seamless
> transfer/migration of pages from one backend to another (perhaps from
> high to low priority backends, i.e writeback).
>
> I think this will require some careful redesigns. The closest thing we
> have right now is zswap -> backing swapfile. But it is currently
> handled in a rather peculiar manner - the underlying swap slot has
> already been reserved for the zswap entry. But there's a couple of
> problems with this:
>
> a) This is wasteful. We're essentially having the same piece of data
> occupying spaces in two levels in the hierarchies.
> b) How do we generalize to a multi-tier hierarchy?
> c) This is a bit too backend-specific. It'd be nice if we can make
> this as backend-agnostic as possible (if possible).
>
> Motivation: I'm currently working/thinking about decoupling zswap and
> swap, and this is one of the more challenging aspects (as I can't seem
> to find a precedent in the swap world for inter-swap backends pages
> migration), and especially with respect to concurrent loads (and
> swapcache interactions).
>
> I don't have good answers/designs quite yet - just raising some
> questions/concerns :)

I actually have one more problem here. to swap in a large folio,
in case we have 16 subpages, it could be that 5 subpages are
in zswap and 11 are in the backend swap in some cases. we get
no way to differententiate this unless we iterate subpage one by one
within a large folio before calling zswap_load(). right now,
swap_read_folio() can't handle this,

void swap_read_folio(struct folio *folio, bool synchronous,
                struct swap_iocb **plug)
{
       ...

        if (zswap_load(folio)) {
                folio_mark_uptodate(folio);
                folio_unlock(folio);
        } else if (data_race(sis->flags & SWP_FS_OPS)) {
                swap_read_folio_fs(folio, plug);
        } else if (synchronous || (sis->flags & SWP_SYNCHRONOUS_IO)) {
                swap_read_folio_bdev_sync(folio, sis);
        } else {
                swap_read_folio_bdev_async(folio, sis);
        }
        ...
}

Thanks
Barry





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux