On 28/02/2024 12:12, David Hildenbrand wrote: >>> How relevant is it? Relevant enough that someone decided to put that >>> optimization in? I don't know :) >> >> I'll have one last go at convincing you: Huang Ying (original author) commented >> "I believe this should be OK. Better to compare the performance too." at [1]. >> That implies to me that perhaps the optimization wasn't in response to a >> specific problem after all. Do you have any thoughts, Huang? > > Might make sense to include that in the patch description! > >> OK so if we really do need to keep this optimization, here are some ideas: >> >> Fundamentally, we would like to be able to figure out the size of the swap slot >> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For >> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster >> to mark it as PMD_SIZE. >> >> Going forwards, we want to support all sizes (power-of-2). Most of the time, a >> cluster will contain only one size of THPs, but this is not the case when a THP >> in the swapcache gets split or when an order-0 slot gets stolen. We expect these >> cases to be rare. >> >> 1) Keep the size of the smallest swap entry in the cluster header. Most of the >> time it will be the full size of the swap entry, but sometimes it will cover >> only a portion. In the latter case you may see a false negative for >> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare. >> There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We >> wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster >> to order-0). I think that is safe, but haven't completely convinced myself yet. >> >> 2) allocate 4 bits per (small) swap slot to hold the order. This will give >> precise information and is conceptually simpler to understand, but will cost >> more memory (half as much as the initial swap_map[] again). >> >> I still prefer to avoid this at all if we can (and would like to hear Huang's >> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some >> prototyping. > > Taking a step back: what about we simply batch unmapping of swap entries? > > That is, if we're unmapping a PTE range, we'll collect swap entries (under PT > lock) that reference consecutive swap offsets in the same swap file. > > There, we can then first decrement all the swap counts, and then try minimizing > how often we actually have to try reclaiming swap space (lookup folio, see it's > a large folio that we cannot reclaim or could reclaim, ...). > > Might need some fine-tuning in swap code to "advance" to the next entry to try > freeing up, but we certainly can do better than what we would do right now. > Hi, I'm struggling to convince myself that free_swap_and_cache() can't race with with swapoff(). Can anyone explain that this is safe? I *think* they are both serialized by the PTL, since all callers of free_swap_and_cache() (except shmem) have the PTL, and swapoff() calls try_to_unuse() early on, which takes the PTL as it iterates over every vma in every mm. It looks like shmem is handled specially by a call to shmem_unuse(), but I can't see the exact serialization mechanism. I've implemented a batching function, as David suggested above, but I'm trying to convince myself that it is safe for it to access si->swap_map[] without a lock (i.e. that swapoff() can't concurrently free it). But I think free_swap_and_cache() as it already exists depends on being able to access the si without an explicit lock, so I'm assuming the same mechanism will protect my new changes. But I want to be sure I understand the mechanism... This is the existing free_swap_and_cache(). I think _swap_info_get() would break if this could race with swapoff(), and __swap_entry_free() looks up the cluster from an array, which would also be freed by swapoff if racing: int free_swap_and_cache(swp_entry_t entry) { struct swap_info_struct *p; unsigned char count; if (non_swap_entry(entry)) return 1; p = _swap_info_get(entry); if (p) { count = __swap_entry_free(p, entry); if (count == SWAP_HAS_CACHE) __try_to_reclaim_swap(p, swp_offset(entry), TTRS_UNMAPPED | TTRS_FULL); } return p != NULL; } This is my new function. I want to be sure that it's safe to do the READ_ONCE(si->swap_info[...]): void free_swap_and_cache_nr(swp_entry_t entry, int nr) { unsigned long end = swp_offset(entry) + nr; unsigned type = swp_type(entry); struct swap_info_struct *si; unsigned long offset; if (non_swap_entry(entry)) return; si = _swap_info_get(entry); if (!si || end > si->max) return; /* * First free all entries in the range. */ for (offset = swp_offset(entry); offset < end; offset++) { VM_WARN_ON(data_race(!si->swap_map[offset])); __swap_entry_free(si, swp_entry(type, offset)); } /* * Now go back over the range trying to reclaim the swap cache. This is * more efficient for large folios because we will only try to reclaim * the swap once per folio in the common case. If we do * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the * latter will get a reference and lock the folio for every individual * page but will only succeed once the swap slot for every subpage is * zero. */ for (offset = swp_offset(entry); offset < end; offset += nr) { nr = 1; if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) { << HERE /* * Folios are always naturally aligned in swap so * advance forward to the next boundary. Zero means no * folio was found for the swap entry, so advance by 1 * in this case. Negative value means folio was found * but could not be reclaimed. Here we can still advance * to the next boundary. */ nr = __try_to_reclaim_swap(si, offset, TTRS_UNMAPPED | TTRS_FULL); if (nr == 0) nr = 1; else if (nr < 0) nr = -nr; nr = ALIGN(offset + 1, nr) - offset; } } } Thanks, Ryan