On 28/02/2024 12:12, David Hildenbrand wrote: >>> How relevant is it? Relevant enough that someone decided to put that >>> optimization in? I don't know :) >> >> I'll have one last go at convincing you: Huang Ying (original author) commented >> "I believe this should be OK. Better to compare the performance too." at [1]. >> That implies to me that perhaps the optimization wasn't in response to a >> specific problem after all. Do you have any thoughts, Huang? > > Might make sense to include that in the patch description! > >> OK so if we really do need to keep this optimization, here are some ideas: >> >> Fundamentally, we would like to be able to figure out the size of the swap slot >> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For >> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster >> to mark it as PMD_SIZE. >> >> Going forwards, we want to support all sizes (power-of-2). Most of the time, a >> cluster will contain only one size of THPs, but this is not the case when a THP >> in the swapcache gets split or when an order-0 slot gets stolen. We expect these >> cases to be rare. >> >> 1) Keep the size of the smallest swap entry in the cluster header. Most of the >> time it will be the full size of the swap entry, but sometimes it will cover >> only a portion. In the latter case you may see a false negative for >> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare. >> There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We >> wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster >> to order-0). I think that is safe, but haven't completely convinced myself yet. >> >> 2) allocate 4 bits per (small) swap slot to hold the order. This will give >> precise information and is conceptually simpler to understand, but will cost >> more memory (half as much as the initial swap_map[] again). >> >> I still prefer to avoid this at all if we can (and would like to hear Huang's >> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some >> prototyping. > > Taking a step back: what about we simply batch unmapping of swap entries? > > That is, if we're unmapping a PTE range, we'll collect swap entries (under PT > lock) that reference consecutive swap offsets in the same swap file. Yes in principle, but there are 4 places where free_swap_and_cache() is called, and only 2 of those are really amenable to batching (zap_pte_range() and madvise_free_pte_range()). So the other two users will still take the "slow" path. Maybe those 2 callsites are the only ones that really matter? I can certainly have a stab at this approach. > > There, we can then first decrement all the swap counts, and then try minimizing > how often we actually have to try reclaiming swap space (lookup folio, see it's > a large folio that we cannot reclaim or could reclaim, ...). > > Might need some fine-tuning in swap code to "advance" to the next entry to try > freeing up, but we certainly can do better than what we would do right now. I'm not sure I've understood this. Isn't advancing just a matter of: entry = swp_entry(swp_type(entry), swp_offset(entry) + 1);