On Mon, Mar 4, 2024 at 2:04 PM Jared Hulbert <jaredeh@xxxxxxxxx> wrote: > > On Mon, Mar 4, 2024 at 10:44 AM Kairui Song <ryncsn@xxxxxxxxx> wrote: > > > > On Fri, Mar 1, 2024 at 5:27 PM Chris Li <chrisl@xxxxxxxxxx> wrote: > > > > > > In last year's LSF/MM I talked about a VFS-like swap system. That is > > > the pony that was chosen. > > > However, I did not have much chance to go into details. > > > > > > This year, I would like to discuss what it takes to re-architect the > > > whole swap back end from scratch? > > > > Very interesting topic! Have been stepping into many pitfalls and > > existing issues of SWAP recently, and things are complicated, > > definitely need more attention. > > > > > > > > Let’s start from the requirements for the swap back end. > > > > > > 1) support the existing swap usage (not the implementation). > > > > > > Some other design goals:: > > > > > > 2) low per swap entry memory usage. > > > > > > 3) low io latency. > > > > > > What are the functions the swap system needs to support? > > > > > > At the device level. Swap systems need to support a list of swap files > > > with a priority order. The same priority of swap device will do round > > > robin writing on the swap device. The swap device type includes zswap, > > > zram, SSD, spinning hard disk, swap file in a file system. > > > > > > At the swap entry level, here is the list of existing swap entry usage: > > > > > > * Swap entry allocation and free. Each swap entry needs to be > > > associated with a location of the disk space in the swapfile. (offset > > > of swap entry). > > > * Each swap entry needs to track the map count of the entry. (swap_map) > > > * Each swap entry needs to be able to find the associated memory > > > cgroup. (swap_cgroup_ctrl->map) > > > * Swap cache. Lookup folio/shadow from swap entry > > > * Swap page writes through a swapfile in a file system other than a > > > block device. (swap_extent) > > > * Shadow entry. (store in swap cache) > > > > > > Any new swap back end might have different internal implementation, > > > but needs to support the above usage. For example, using the existing > > > file system as swap backend, per vma or per swap entry map to a file > > > would mean it needs additional data structure to track the > > > swap_cgroup_ctrl, combined with the size of the file inode. It would > > > be challenging to meet the design goal 2) and 3) using another file > > > system as it is.. > > > > > > I am considering grouping different swap entry data into one single > > > struct and dynamically allocate it so no upfront allocation of > > > swap_map. > > > > Just some modest ideas about this ... > > > > Besides the usage, I noticed currently we already have following > > metadata reserved for SWAP: > > SWAP map (Array of char) > > SWAP shadow (XArray of pointer/long) > > SWAP cgroup map (Array of short) > > And ZSWAP has its own data. > > Also the folio->private (SWAP entry) > > PTE (SWAP entry) > > > > Maybe something new can combine and make better use of these, also > > reduce redundant. eg. SWAP shadow (assume it's not shrinked) contains > > cgroup info already; a folio in the swap cache having it's -> private > > pointing to SWAP entry while mapping/index are all empty; These may > > indicate some space for a smarter usage. > > > > One easy approach might be making better use of the current swap cache > > xarray. We can never skip it even for direct swap in path (SYNC_IO), > > I'm working on it (not for a whole new swap abstraction, just trying > > to resolve some other issue and optimize things) and so far it seems > > OK. With some optimizations performance is even better than before, as > > we are already doing lookup and shadow cleaning in the current kernel. > > > > And considering XArray is capable of storing ranged data with size of > > order of 2, this gives us a nice tool to store grouped swap metadata > > for folios, and reduce memory overhead. > > > > Following this idea we may be able to have a smoother progressive > > transition to a better design of SWAP (eg. start with storing more > > complex things other than folio/shadow, then make it more > > backend-specified, add features bit by bit), it is more unlikely to > > break things and we can test the stability and performance step by > > step. > > > > > For the swap entry allocation.Current kernel support swap out 0 order > > > or pmd order pages. > > > > > > There are some discussions and patches that add swap out for folio > > > size in between (mTHP) > > > > > > https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@xxxxxxx/ > > > > > > and swap in for mTHP: > > > > > > https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@xxxxxxxxx/ > > > > > > The introduction of swapping different order of pages will further > > > complicate the swap entry fragmentation issue. The swap back end has > > > no way to predict the life cycle of the swap entries. Repeat allocate > > > and free swap entry of different sizes will fragment the swap entries > > > array. If we can’t allocate the contiguous swap entry for a mTHP, it > > > will have to split the mTHP to a smaller size to perform the swap in > > > and out. T > > > > > > Current swap only supports 4K pages or pmd size pages. When adding the > > > other in between sizes, it greatly increases the chance of fragmenting > > > the swap entry space. When no more continuous swap swap entry for > > > mTHP, it will force the mTHP split into 4K pages. If we don’t solve > > > the fragmentation issue. It will be a constant source of splitting the > > > mTHP. > > > > > > Another limitation I would like to address is that swap_writepage can > > > only write out IO in one contiguous chunk, not able to perform > > > non-continuous IO. When the swapfile is close to full, it is likely > > > the unused entry will spread across different locations. It would be > > > nice to be able to read and write large folio using discontiguous disk > > > IO locations. > > > > > > Some possible ideas for the fragmentation issue. > > > > > > a) buddy allocator for swap entities. Similar to the buddy allocator > > > in memory. We can use a buddy allocator system for the swap entry to > > > avoid the low order swap entry fragment too much of the high order > > > swap entry. It should greatly reduce the fragmentation caused by > > > allocate and free of the swap entry of different sizes. However the > > > buddy allocator has its own limit as well. Unlike system memory, we > > > can move and compact the memory. There is no rmap for swap entry, it > > > is much harder to move a swap entry to another disk location. So the > > > buddy allocator for swap will help, but not solve all the > > > fragmentation issues. > > > > > > b) Large swap entries. Take file as an example, a file on the file > > > system can write to a discontinuous disk location. The file system > > > responsible for tracking how to map the file offset into disk > > > location. A large swap entry can have a similar indirection array map > > > out the disk location for different subpages within a folio. This > > > allows a large folio to write out dis-continguos swap entries on the > > > swap file. The array will need to store somewhere as part of the > > > overhead.When allocating swap entries for the folio, we can allocate a > > > batch of smaller 4k swap entries into an array. Use this array to > > > read/write the large folio. There will be a lot of plumbing work to > > > get it to work. > > > > > > Solution a) and b) can work together as well. Only use b) if not able > > > to allocate swap entries from a). > > > > Despite the limitation, I think a) is a better approach. non-sequel > > read/write is very performance unfriendly even for ZRAM, so it will be > > better if the data is continuous in both RAM and SWAP. > > Why is it so unfriendly with ZRAM? Because ZRAM currently is operating as a block device backend. Discontinued write might consider as many small IO, it will not be able to perform bigger buffer compression. It is also possible to modify the ZRAM API to accept those IO vec write differently. > > I'm surprised to hear that. Even with NVMe SSD's (controversial take > here) the penalty for non-sequential writes, if batched, is not > necessarily significant, you need other factors to be in play in the > drive state/usage. > > > And if not, something like VMA readahead can already help improve > > performance. But we have seen this have a negative impact with fast > > devices like ZRAM so it's disabled in the current kernel. > > > > Migration of swap entries is a good thing to have, but the migration > > cost seems too high... I don't have a better idea on this. > > One of my issues with the arch IIRC is that though it's called > swap_type/swap_offset in the PTE, that it's functionally > swap_partition/swap_offset. The consequence being there is no > practical way to for example migrate swapped pages from one swap > backend to another. Instead we awkwardly do these sort of things > inside the backend. You will need to track extra data struct for writing from one swap device to another. e.g. source and destination swap entries. It requires additional data structures that do not exist in the current swap back end for sure. > > I need to look at the swap cache xarray (pointer to where to start > would be welcome). Would it be feasible to enable a redirection > there? You can take a look at add_to_swap_cache() . Look at its internals and its callers. That should get you into the swap cache internals. Chris