On Thu 07-03-24 15:56:57, Chuanhua Han via Lsf-pc wrote: > > 在 2024/3/1 17:24, Chris Li 写道: > > In last year's LSF/MM I talked about a VFS-like swap system. That is > > the pony that was chosen. > > However, I did not have much chance to go into details. > > > > This year, I would like to discuss what it takes to re-architect the > > whole swap back end from scratch? > > > > Let’s start from the requirements for the swap back end. > > > > 1) support the existing swap usage (not the implementation). > > > > Some other design goals:: > > > > 2) low per swap entry memory usage. > > > > 3) low io latency. > > > > What are the functions the swap system needs to support? > > > > At the device level. Swap systems need to support a list of swap files > > with a priority order. The same priority of swap device will do round > > robin writing on the swap device. The swap device type includes zswap, > > zram, SSD, spinning hard disk, swap file in a file system. > > > > At the swap entry level, here is the list of existing swap entry usage: > > > > * Swap entry allocation and free. Each swap entry needs to be > > associated with a location of the disk space in the swapfile. (offset > > of swap entry). > > * Each swap entry needs to track the map count of the entry. (swap_map) > > * Each swap entry needs to be able to find the associated memory > > cgroup. (swap_cgroup_ctrl->map) > > * Swap cache. Lookup folio/shadow from swap entry > > * Swap page writes through a swapfile in a file system other than a > > block device. (swap_extent) > > * Shadow entry. (store in swap cache) > > > > Any new swap back end might have different internal implementation, > > but needs to support the above usage. For example, using the existing > > file system as swap backend, per vma or per swap entry map to a file > > would mean it needs additional data structure to track the > > swap_cgroup_ctrl, combined with the size of the file inode. It would > > be challenging to meet the design goal 2) and 3) using another file > > system as it is.. > > > > I am considering grouping different swap entry data into one single > > struct and dynamically allocate it so no upfront allocation of > > swap_map. > > > > For the swap entry allocation.Current kernel support swap out 0 order > > or pmd order pages. > > > > There are some discussions and patches that add swap out for folio > > size in between (mTHP) > > > > https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@xxxxxxx/ > > > > and swap in for mTHP: > > > > https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@xxxxxxxxx/ > > > > The introduction of swapping different order of pages will further > > complicate the swap entry fragmentation issue. The swap back end has > > no way to predict the life cycle of the swap entries. Repeat allocate > > and free swap entry of different sizes will fragment the swap entries > > array. If we can’t allocate the contiguous swap entry for a mTHP, it > > will have to split the mTHP to a smaller size to perform the swap in > > and out. T > > > > Current swap only supports 4K pages or pmd size pages. When adding the > > other in between sizes, it greatly increases the chance of fragmenting > > the swap entry space. When no more continuous swap swap entry for > > mTHP, it will force the mTHP split into 4K pages. If we don’t solve > > the fragmentation issue. It will be a constant source of splitting the > > mTHP. > > > > Another limitation I would like to address is that swap_writepage can > > only write out IO in one contiguous chunk, not able to perform > > non-continuous IO. When the swapfile is close to full, it is likely > > the unused entry will spread across different locations. It would be > > nice to be able to read and write large folio using discontiguous disk > > IO locations. > > > > Some possible ideas for the fragmentation issue. > > > > a) buddy allocator for swap entities. Similar to the buddy allocator > > in memory. We can use a buddy allocator system for the swap entry to > > avoid the low order swap entry fragment too much of the high order > > swap entry. It should greatly reduce the fragmentation caused by > > allocate and free of the swap entry of different sizes. However the > > buddy allocator has its own limit as well. Unlike system memory, we > > can move and compact the memory. There is no rmap for swap entry, it > > is much harder to move a swap entry to another disk location. So the > > buddy allocator for swap will help, but not solve all the > > fragmentation issues. > I have an idea here😁 > > Each swap device is divided into multiple chunks, and each chunk is > allocated to meet each order allocation > (order indicates the order of swapout's folio, and each chunk is used > for only one order). > This can solve the fragmentation problem, which is much simpler than > buddy, easier to implement, > and can be compatible with multiple sizes, similar to small slab allocator. > > 1) Add structure members > In the swap_info_struct structure, we only need to add the offset array > representing the offset of each order search. > eg: > > #define MTHP_NR_ORDER 9 > > struct swap_info_struct { > ... > long order_off[MTHP_NR_ORDER]; > ... > }; > > Note: order_off = -1 indicates that this order is not supported. > > 2) Initialize > Set the proportion of swap device occupied by each order. > For the sake of simplicity, there are 8 kinds of orders. > Number of slots occupied by each order: chunk_size = 1/8 * maxpages > (maxpages indicates the maximum number of available slots in the current > swap device) Well, but then if you fill in space of a particular order and need to swap out a page of that order what do you do? Return ENOSPC prematurely? Frankly as I'm reading the discussions here, it seems to me you are trying to reinvent a lot of things from the filesystem space :) Like block allocation with reasonably efficient fragmentation prevention, transparent data compression (zswap), hierarchical storage management (i.e., moving data between different backing stores), efficient way to get from VMA+offset to the place on disk where the content is stored. Sure you still don't need a lot of things modern filesystems do like permissions, directory structure (or even more complex namespacing stuff), all the stuff achieving fs consistency after a crash, etc. But still what you need is a notable portion of what filesystems do. So maybe it would be time to implement swap as a proper filesystem? Or even better we could think about factoring out these bits out of some existing filesystem to share code? Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR