On Fri, Mar 1, 2024 at 4:24 PM Chris Li <chrisl@xxxxxxxxxx> wrote: > > In last year's LSF/MM I talked about a VFS-like swap system. That is > the pony that was chosen. > However, I did not have much chance to go into details. I'd love to attend this talk/chat :) > > This year, I would like to discuss what it takes to re-architect the > whole swap back end from scratch? > > Let’s start from the requirements for the swap back end. > > 1) support the existing swap usage (not the implementation). > > Some other design goals:: > > 2) low per swap entry memory usage. > > 3) low io latency. > > What are the functions the swap system needs to support? > > At the device level. Swap systems need to support a list of swap files > with a priority order. The same priority of swap device will do round > robin writing on the swap device. The swap device type includes zswap, > zram, SSD, spinning hard disk, swap file in a file system. > > At the swap entry level, here is the list of existing swap entry usage: > > * Swap entry allocation and free. Each swap entry needs to be > associated with a location of the disk space in the swapfile. (offset > of swap entry). > * Each swap entry needs to track the map count of the entry. (swap_map) > * Each swap entry needs to be able to find the associated memory > cgroup. (swap_cgroup_ctrl->map) > * Swap cache. Lookup folio/shadow from swap entry > * Swap page writes through a swapfile in a file system other than a > block device. (swap_extent) > * Shadow entry. (store in swap cache) IMHO, one thing this new abstraction should support is seamless transfer/migration of pages from one backend to another (perhaps from high to low priority backends, i.e writeback). I think this will require some careful redesigns. The closest thing we have right now is zswap -> backing swapfile. But it is currently handled in a rather peculiar manner - the underlying swap slot has already been reserved for the zswap entry. But there's a couple of problems with this: a) This is wasteful. We're essentially having the same piece of data occupying spaces in two levels in the hierarchies. b) How do we generalize to a multi-tier hierarchy? c) This is a bit too backend-specific. It'd be nice if we can make this as backend-agnostic as possible (if possible). Motivation: I'm currently working/thinking about decoupling zswap and swap, and this is one of the more challenging aspects (as I can't seem to find a precedent in the swap world for inter-swap backends pages migration), and especially with respect to concurrent loads (and swapcache interactions). I don't have good answers/designs quite yet - just raising some questions/concerns :)