On Fri, Mar 1, 2024 at 10:53 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote: > > On Fri, Mar 1, 2024 at 4:24 PM Chris Li <chrisl@xxxxxxxxxx> wrote: > > > > In last year's LSF/MM I talked about a VFS-like swap system. That is > > the pony that was chosen. > > However, I did not have much chance to go into details. > > I'd love to attend this talk/chat :) > > > > > This year, I would like to discuss what it takes to re-architect the > > whole swap back end from scratch? > > > > Let’s start from the requirements for the swap back end. > > > > 1) support the existing swap usage (not the implementation). > > > > Some other design goals:: > > > > 2) low per swap entry memory usage. > > > > 3) low io latency. > > > > What are the functions the swap system needs to support? > > > > At the device level. Swap systems need to support a list of swap files > > with a priority order. The same priority of swap device will do round > > robin writing on the swap device. The swap device type includes zswap, > > zram, SSD, spinning hard disk, swap file in a file system. > > > > At the swap entry level, here is the list of existing swap entry usage: > > > > * Swap entry allocation and free. Each swap entry needs to be > > associated with a location of the disk space in the swapfile. (offset > > of swap entry). > > * Each swap entry needs to track the map count of the entry. (swap_map) > > * Each swap entry needs to be able to find the associated memory > > cgroup. (swap_cgroup_ctrl->map) > > * Swap cache. Lookup folio/shadow from swap entry > > * Swap page writes through a swapfile in a file system other than a > > block device. (swap_extent) > > * Shadow entry. (store in swap cache) > > IMHO, one thing this new abstraction should support is seamless > transfer/migration of pages from one backend to another (perhaps from > high to low priority backends, i.e writeback). > > I think this will require some careful redesigns. The closest thing we > have right now is zswap -> backing swapfile. But it is currently > handled in a rather peculiar manner - the underlying swap slot has > already been reserved for the zswap entry. But there's a couple of > problems with this: > > a) This is wasteful. We're essentially having the same piece of data > occupying spaces in two levels in the hierarchies. > b) How do we generalize to a multi-tier hierarchy? > c) This is a bit too backend-specific. It'd be nice if we can make > this as backend-agnostic as possible (if possible). > > Motivation: I'm currently working/thinking about decoupling zswap and > swap, and this is one of the more challenging aspects (as I can't seem > to find a precedent in the swap world for inter-swap backends pages > migration), and especially with respect to concurrent loads (and > swapcache interactions). > > I don't have good answers/designs quite yet - just raising some > questions/concerns :) I actually have one more problem here. to swap in a large folio, in case we have 16 subpages, it could be that 5 subpages are in zswap and 11 are in the backend swap in some cases. we get no way to differententiate this unless we iterate subpage one by one within a large folio before calling zswap_load(). right now, swap_read_folio() can't handle this, void swap_read_folio(struct folio *folio, bool synchronous, struct swap_iocb **plug) { ... if (zswap_load(folio)) { folio_mark_uptodate(folio); folio_unlock(folio); } else if (data_race(sis->flags & SWP_FS_OPS)) { swap_read_folio_fs(folio, plug); } else if (synchronous || (sis->flags & SWP_SYNCHRONOUS_IO)) { swap_read_folio_bdev_sync(folio, sis); } else { swap_read_folio_bdev_async(folio, sis); } ... } Thanks Barry