On 28/02/2024 14:24, Ryan Roberts wrote: > On 28/02/2024 13:33, Matthew Wilcox wrote: >> On Wed, Feb 28, 2024 at 09:37:06AM +0000, Ryan Roberts wrote: >>> Fundamentally, we would like to be able to figure out the size of the swap slot >>> from the swap entry. Today swap supports 2 sizes; PAGE_SIZE and PMD_SIZE. For >>> PMD_SIZE, it always uses a full cluster, so can easily add a flag to the cluster >>> to mark it as PMD_SIZE. >>> >>> Going forwards, we want to support all sizes (power-of-2). Most of the time, a >>> cluster will contain only one size of THPs, but this is not the case when a THP >>> in the swapcache gets split or when an order-0 slot gets stolen. We expect these >>> cases to be rare. >>> >>> 1) Keep the size of the smallest swap entry in the cluster header. Most of the >>> time it will be the full size of the swap entry, but sometimes it will cover >>> only a portion. In the latter case you may see a false negative for >>> swap_page_trans_huge_swapped() meaning we take the slow path, but that is rare. >>> There is one wrinkle: currently the HUGE flag is cleared in put_swap_folio(). We >>> wouldn't want to do the equivalent in the new scheme (i.e. set the whole cluster >>> to order-0). I think that is safe, but haven't completely convinced myself yet. >>> >>> 2) allocate 4 bits per (small) swap slot to hold the order. This will give >>> precise information and is conceptually simpler to understand, but will cost >>> more memory (half as much as the initial swap_map[] again). >>> >>> I still prefer to avoid this at all if we can (and would like to hear Huang's >>> thoughts). But if its a choice between 1 and 2, I prefer 1 - I'll do some >>> prototyping. >> >> I can't quite bring myself to look up the encoding of swap entries >> but as long as we're willing to restrict ourselves to naturally aligning >> the clusters, there's an encoding (which I believe I invented) that lets >> us encode arbitrary power-of-two sizes with a single bit. >> >> I describe it here: >> https://kernelnewbies.org/MatthewWilcox/NaturallyAlignedOrder >> >> Let me know if it's not clear. > > Ahh yes, I'm familiar with this encoding scheme from other settings. Although > I've previously thought of it as having a bit to indicate whether the scheme is > enabled or not, and if it is enabled then the encoded PFN is: > > PFNe = PFNd | (1 << (log2(n) - 1)) > > Where n is the power-of-2 page count. > > Same thing, I think. > > I think we would have to steal a bit from the offset to make this work, and it > looks like the size of that is bottlnecked on the arch's swp_entry PTE > representation. Looks like there is a MIPS config that only has 17 bits for > offset to begin with, so I doubt we would be able to spare a bit here? Although > it looks possible that there are some unused low bits that could be used... > I think the other problem with this is that it won't tell us which slot in the "swap slot block" each entry is targetting?