On 01/03/2024 17:00, David Hildenbrand wrote: > On 01.03.24 17:44, Ryan Roberts wrote: >> On 01/03/2024 16:31, Matthew Wilcox wrote: >>> On Fri, Mar 01, 2024 at 04:27:32PM +0000, Ryan Roberts wrote: >>>> I've implemented the batching as David suggested, and I'm pretty confident it's >>>> correct. The only problem is that during testing I can't provoke the code to >>>> take the path. I've been pouring through the code but struggling to figure out >>>> under what situation you would expect the swap entry passed to >>>> free_swap_and_cache() to still have a cached folio? Does anyone have any idea? >>>> >>>> This is the original (unbatched) function, after my change, which caused >>>> David's >>>> concern that we would end up calling __try_to_reclaim_swap() far too much: >>>> >>>> int free_swap_and_cache(swp_entry_t entry) >>>> { >>>> struct swap_info_struct *p; >>>> unsigned char count; >>>> >>>> if (non_swap_entry(entry)) >>>> return 1; >>>> >>>> p = _swap_info_get(entry); >>>> if (p) { >>>> count = __swap_entry_free(p, entry); >>>> if (count == SWAP_HAS_CACHE) >>>> __try_to_reclaim_swap(p, swp_offset(entry), >>>> TTRS_UNMAPPED | TTRS_FULL); >>>> } >>>> return p != NULL; >>>> } >>>> >>>> The trouble is, whenever its called, count is always 0, so >>>> __try_to_reclaim_swap() never gets called. >>>> >>>> My test case is allocating 1G anon memory, then doing madvise(MADV_PAGEOUT) >>>> over >>>> it. Then doing either a munmap() or madvise(MADV_FREE), both of which cause >>>> this >>>> function to be called for every PTE, but count is always 0 after >>>> __swap_entry_free() so __try_to_reclaim_swap() is never called. I've tried for >>>> order-0 as well as PTE- and PMD-mapped 2M THP. >>> >>> I think you have to page it back in again, then it will have an entry in >>> the swap cache. Maybe. I know little about anon memory ;-) >> >> Ahh, I was under the impression that the original folio is put into the swap >> cache at swap out, then (I guess) its removed once the IO is complete? I'm sure >> I'm miles out... what exactly is the lifecycle of a folio going through swap out? > > I thought with most (disk) backends you will add it to the swapcache and leave > it there until there is actual memory pressure. Only then, under memory > pressure, you'd actually reclaim the folio. OK, my problem is that I'm using a VM, whose disk shows up as rotating media, so the swap subsystem refuses to swap out THPs to that and they get split. To solve that, (and to speed up testing) I moved to the block ram disk, which convinces swap to swap-out THPs. But that causes the folios to be removed from the swap cache (I assumed because its syncrhonous, but maybe there is a flag somewhere to affect that behavior?) And I can't convince QEMU to emulate an SSD to the guest under macos. Perhaps the easiest thing is to hack it to ignore the rotating media flag. > > You can fault it back in from the swapcache without having to go to disk. > > That's how you can today end up with a THP in the swapcache: during swapin from > disk (after the folio was reclaimed) you'd currently only get order-0 folios. > > At least that was my assumption with my MADV_PAGEOUT testing so far :) >