Chris Li <chrisl@xxxxxxxxxx> 于2024年5月16日周四 07:07写道: > > Hi, > > Here is my slide for today's swap abstraction discussion. > > https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view Great, Thank you! > > Chris > > On Thu, Mar 14, 2024 at 4:20 AM Chuanhua Han <chuanhuahan@xxxxxxxxx> wrote: > > > > Jan Kara <jack@xxxxxxx> 于2024年3月14日周四 16:28写道: > > > > > > On Fri 08-03-24 10:02:20, Chuanhua Han wrote: > > > > > > > > 在 2024/3/7 22:03, Jan Kara 写道: > > > > > On Thu 07-03-24 15:56:57, Chuanhua Han via Lsf-pc wrote: > > > > >> 在 2024/3/1 17:24, Chris Li 写道: > > > > >>> In last year's LSF/MM I talked about a VFS-like swap system. That is > > > > >>> the pony that was chosen. > > > > >>> However, I did not have much chance to go into details. > > > > >>> > > > > >>> This year, I would like to discuss what it takes to re-architect the > > > > >>> whole swap back end from scratch? > > > > >>> > > > > >>> Let’s start from the requirements for the swap back end. > > > > >>> > > > > >>> 1) support the existing swap usage (not the implementation). > > > > >>> > > > > >>> Some other design goals:: > > > > >>> > > > > >>> 2) low per swap entry memory usage. > > > > >>> > > > > >>> 3) low io latency. > > > > >>> > > > > >>> What are the functions the swap system needs to support? > > > > >>> > > > > >>> At the device level. Swap systems need to support a list of swap files > > > > >>> with a priority order. The same priority of swap device will do round > > > > >>> robin writing on the swap device. The swap device type includes zswap, > > > > >>> zram, SSD, spinning hard disk, swap file in a file system. > > > > >>> > > > > >>> At the swap entry level, here is the list of existing swap entry usage: > > > > >>> > > > > >>> * Swap entry allocation and free. Each swap entry needs to be > > > > >>> associated with a location of the disk space in the swapfile. (offset > > > > >>> of swap entry). > > > > >>> * Each swap entry needs to track the map count of the entry. (swap_map) > > > > >>> * Each swap entry needs to be able to find the associated memory > > > > >>> cgroup. (swap_cgroup_ctrl->map) > > > > >>> * Swap cache. Lookup folio/shadow from swap entry > > > > >>> * Swap page writes through a swapfile in a file system other than a > > > > >>> block device. (swap_extent) > > > > >>> * Shadow entry. (store in swap cache) > > > > >>> > > > > >>> Any new swap back end might have different internal implementation, > > > > >>> but needs to support the above usage. For example, using the existing > > > > >>> file system as swap backend, per vma or per swap entry map to a file > > > > >>> would mean it needs additional data structure to track the > > > > >>> swap_cgroup_ctrl, combined with the size of the file inode. It would > > > > >>> be challenging to meet the design goal 2) and 3) using another file > > > > >>> system as it is.. > > > > >>> > > > > >>> I am considering grouping different swap entry data into one single > > > > >>> struct and dynamically allocate it so no upfront allocation of > > > > >>> swap_map. > > > > >>> > > > > >>> For the swap entry allocation.Current kernel support swap out 0 order > > > > >>> or pmd order pages. > > > > >>> > > > > >>> There are some discussions and patches that add swap out for folio > > > > >>> size in between (mTHP) > > > > >>> > > > > >>> https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@xxxxxxx/ > > > > >>> > > > > >>> and swap in for mTHP: > > > > >>> > > > > >>> https://lore.kernel.org/all/20240229003753.134193-1-21cnbao@xxxxxxxxx/ > > > > >>> > > > > >>> The introduction of swapping different order of pages will further > > > > >>> complicate the swap entry fragmentation issue. The swap back end has > > > > >>> no way to predict the life cycle of the swap entries. Repeat allocate > > > > >>> and free swap entry of different sizes will fragment the swap entries > > > > >>> array. If we can’t allocate the contiguous swap entry for a mTHP, it > > > > >>> will have to split the mTHP to a smaller size to perform the swap in > > > > >>> and out. T > > > > >>> > > > > >>> Current swap only supports 4K pages or pmd size pages. When adding the > > > > >>> other in between sizes, it greatly increases the chance of fragmenting > > > > >>> the swap entry space. When no more continuous swap swap entry for > > > > >>> mTHP, it will force the mTHP split into 4K pages. If we don’t solve > > > > >>> the fragmentation issue. It will be a constant source of splitting the > > > > >>> mTHP. > > > > >>> > > > > >>> Another limitation I would like to address is that swap_writepage can > > > > >>> only write out IO in one contiguous chunk, not able to perform > > > > >>> non-continuous IO. When the swapfile is close to full, it is likely > > > > >>> the unused entry will spread across different locations. It would be > > > > >>> nice to be able to read and write large folio using discontiguous disk > > > > >>> IO locations. > > > > >>> > > > > >>> Some possible ideas for the fragmentation issue. > > > > >>> > > > > >>> a) buddy allocator for swap entities. Similar to the buddy allocator > > > > >>> in memory. We can use a buddy allocator system for the swap entry to > > > > >>> avoid the low order swap entry fragment too much of the high order > > > > >>> swap entry. It should greatly reduce the fragmentation caused by > > > > >>> allocate and free of the swap entry of different sizes. However the > > > > >>> buddy allocator has its own limit as well. Unlike system memory, we > > > > >>> can move and compact the memory. There is no rmap for swap entry, it > > > > >>> is much harder to move a swap entry to another disk location. So the > > > > >>> buddy allocator for swap will help, but not solve all the > > > > >>> fragmentation issues. > > > > >> I have an idea here😁 > > > > >> > > > > >> Each swap device is divided into multiple chunks, and each chunk is > > > > >> allocated to meet each order allocation > > > > >> (order indicates the order of swapout's folio, and each chunk is used > > > > >> for only one order). > > > > >> This can solve the fragmentation problem, which is much simpler than > > > > >> buddy, easier to implement, > > > > >> and can be compatible with multiple sizes, similar to small slab allocator. > > > > >> > > > > >> 1) Add structure members > > > > >> In the swap_info_struct structure, we only need to add the offset array > > > > >> representing the offset of each order search. > > > > >> eg: > > > > >> > > > > >> #define MTHP_NR_ORDER 9 > > > > >> > > > > >> struct swap_info_struct { > > > > >> ... > > > > >> long order_off[MTHP_NR_ORDER]; > > > > >> ... > > > > >> }; > > > > >> > > > > >> Note: order_off = -1 indicates that this order is not supported. > > > > >> > > > > >> 2) Initialize > > > > >> Set the proportion of swap device occupied by each order. > > > > >> For the sake of simplicity, there are 8 kinds of orders. > > > > >> Number of slots occupied by each order: chunk_size = 1/8 * maxpages > > > > >> (maxpages indicates the maximum number of available slots in the current > > > > >> swap device) > > > > > Well, but then if you fill in space of a particular order and need to swap > > > > > out a page of that order what do you do? Return ENOSPC prematurely? > > > > If we swapout a subpage of large folio(due to a split in large folio), > > > > Simply search for a free swap entry from order_off[0]. > > > > > > I meant what are you going to do if you want to swapout 2MB huge page but > > > you don't have any free swap entry of the appropriate order? History shows > > > that these schemes where you partition available space into buckets of > > > pages of different order tends to fragment rather quickly so you need to > > > also implement some defragmentation / compaction scheme and once you do > > > that you are at the complexity of a standard filesystem block allocator. > > > That is all I wanted to point at :) > > OK, got it! It's true that my approach doesn't eliminate > > fragmentation, but it can be > > mitigated to some extent, and the method itself doesn't currently > > involve complex > > file system operations. > > > > > > Honza > > > -- > > > Jan Kara <jack@xxxxxxxx> > > > SUSE Labs, CR > > > > > Thnaks, > > Chuanhua -- Thanks, Chuanhua