Jan Kara <jack@xxxxxxx> 于2024年3月14日周四 16:28写道: > > On Fri 08-03-24 10:02:20, Chuanhua Han wrote: > > > > 在 2024/3/7 22:03, Jan Kara 写道: > > > On Thu 07-03-24 15:56:57, Chuanhua Han via Lsf-pc wrote: > > >> 在 2024/3/1 17:24, Chris Li 写道: > > >>> In last year's LSF/MM I talked about a VFS-like swap system. That is > > >>> the pony that was chosen. > > >>> However, I did not have much chance to go into details. > > >>> > > >>> This year, I would like to discuss what it takes to re-architect the > > >>> whole swap back end from scratch? > > >>> > > >>> Let’s start from the requirements for the swap back end. > > >>> > > >>> 1) support the existing swap usage (not the implementation). > > >>> > > >>> Some other design goals:: > > >>> > > >>> 2) low per swap entry memory usage. > > >>> > > >>> 3) low io latency. > > >>> > > >>> What are the functions the swap system needs to support? > > >>> > > >>> At the device level. Swap systems need to support a list of swap files > > >>> with a priority order. The same priority of swap device will do round > > >>> robin writing on the swap device. The swap device type includes zswap, > > >>> zram, SSD, spinning hard disk, swap file in a file system. > > >>> > > >>> At the swap entry level, here is the list of existing swap entry usage: > > >>> > > >>> * Swap entry allocation and free. Each swap entry needs to be > > >>> associated with a location of the disk space in the swapfile. (offset > > >>> of swap entry). > > >>> * Each swap entry needs to track the map count of the entry. (swap_map) > > >>> * Each swap entry needs to be able to find the associated memory > > >>> cgroup. (swap_cgroup_ctrl->map) > > >>> * Swap cache. Lookup folio/shadow from swap entry > > >>> * Swap page writes through a swapfile in a file system other than a > > >>> block device. (swap_extent) > > >>> * Shadow entry. (store in swap cache) > > >>> > > >>> Any new swap back end might have different internal implementation, > > >>> but needs to support the above usage. For example, using the existing > > >>> file system as swap backend, per vma or per swap entry map to a file > > >>> would mean it needs additional data structure to track the > > >>> swap_cgroup_ctrl, combined with the size of the file inode. It would > > >>> be challenging to meet the design goal 2) and 3) using another file > > >>> system as it is.. > > >>> > > >>> I am considering grouping different swap entry data into one single > > >>> struct and dynamically allocate it so no upfront allocation of > > >>> swap_map. > > >>> > > >>> For the swap entry allocation.Current kernel support swap out 0 order > > >>> or pmd order pages. > > >>> > > >>> There are some discussions and patches that add swap out for folio > > >>> size in between (mTHP) > > >>> > > >>> https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@xxxxxxx/ > > >>> > > >>> and swap in for mTHP: > > >>> > > >>> https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@xxxxxxxxx/ > > >>> > > >>> The introduction of swapping different order of pages will further > > >>> complicate the swap entry fragmentation issue. The swap back end has > > >>> no way to predict the life cycle of the swap entries. Repeat allocate > > >>> and free swap entry of different sizes will fragment the swap entries > > >>> array. If we can’t allocate the contiguous swap entry for a mTHP, it > > >>> will have to split the mTHP to a smaller size to perform the swap in > > >>> and out. T > > >>> > > >>> Current swap only supports 4K pages or pmd size pages. When adding the > > >>> other in between sizes, it greatly increases the chance of fragmenting > > >>> the swap entry space. When no more continuous swap swap entry for > > >>> mTHP, it will force the mTHP split into 4K pages. If we don’t solve > > >>> the fragmentation issue. It will be a constant source of splitting the > > >>> mTHP. > > >>> > > >>> Another limitation I would like to address is that swap_writepage can > > >>> only write out IO in one contiguous chunk, not able to perform > > >>> non-continuous IO. When the swapfile is close to full, it is likely > > >>> the unused entry will spread across different locations. It would be > > >>> nice to be able to read and write large folio using discontiguous disk > > >>> IO locations. > > >>> > > >>> Some possible ideas for the fragmentation issue. > > >>> > > >>> a) buddy allocator for swap entities. Similar to the buddy allocator > > >>> in memory. We can use a buddy allocator system for the swap entry to > > >>> avoid the low order swap entry fragment too much of the high order > > >>> swap entry. It should greatly reduce the fragmentation caused by > > >>> allocate and free of the swap entry of different sizes. However the > > >>> buddy allocator has its own limit as well. Unlike system memory, we > > >>> can move and compact the memory. There is no rmap for swap entry, it > > >>> is much harder to move a swap entry to another disk location. So the > > >>> buddy allocator for swap will help, but not solve all the > > >>> fragmentation issues. > > >> I have an idea here😁 > > >> > > >> Each swap device is divided into multiple chunks, and each chunk is > > >> allocated to meet each order allocation > > >> (order indicates the order of swapout's folio, and each chunk is used > > >> for only one order). > > >> This can solve the fragmentation problem, which is much simpler than > > >> buddy, easier to implement, > > >> and can be compatible with multiple sizes, similar to small slab allocator. > > >> > > >> 1) Add structure members > > >> In the swap_info_struct structure, we only need to add the offset array > > >> representing the offset of each order search. > > >> eg: > > >> > > >> #define MTHP_NR_ORDER 9 > > >> > > >> struct swap_info_struct { > > >> ... > > >> long order_off[MTHP_NR_ORDER]; > > >> ... > > >> }; > > >> > > >> Note: order_off = -1 indicates that this order is not supported. > > >> > > >> 2) Initialize > > >> Set the proportion of swap device occupied by each order. > > >> For the sake of simplicity, there are 8 kinds of orders. > > >> Number of slots occupied by each order: chunk_size = 1/8 * maxpages > > >> (maxpages indicates the maximum number of available slots in the current > > >> swap device) > > > Well, but then if you fill in space of a particular order and need to swap > > > out a page of that order what do you do? Return ENOSPC prematurely? > > If we swapout a subpage of large folio(due to a split in large folio), > > Simply search for a free swap entry from order_off[0]. > > I meant what are you going to do if you want to swapout 2MB huge page but > you don't have any free swap entry of the appropriate order? History shows > that these schemes where you partition available space into buckets of > pages of different order tends to fragment rather quickly so you need to > also implement some defragmentation / compaction scheme and once you do > that you are at the complexity of a standard filesystem block allocator. > That is all I wanted to point at :) OK, got it! It's true that my approach doesn't eliminate fragmentation, but it can be mitigated to some extent, and the method itself doesn't currently involve complex file system operations. > > Honza > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR > Thnaks, Chuanhua