On Tue, Feb 28, 2023 at 03:22:20PM -0800, Chris Li wrote: > Hi Matthew, > > On Sun, Feb 19, 2023 at 04:31:33AM +0000, Matthew Wilcox wrote: > > > > I think an overhaul of the swap code is long overdue. I appreciate > > you're very much focused on zswap, but there are many other problems. > > For example, swap does not work on zoned devices. Swap readahead is > > generally physical (ie optimised for spinning discs) rather than logical > > (more appropriate for SSDs). Swap's management of free space is crude > > compared to real filesystems. The way that swap bypasses the filesystem > > when writing to swap files is awful. I haven't even started to look at > > Can you expand a bit on that? I assume you want to see the swap file > behavior more like a normal file system and reuse more of the readpage() > and writepage() path. Actually, no, readpage() and writepage() should be reserved for page cache. We now have a ->swap_rw(), but it's only implemented by nfs so far. Instead of constructing its own BIOs, swap should invoke ->swap_rw for every filesystem. I suspect we can do a fairly generic block_swap_rw() for the vast majority of filesystems. > > what changes need to be made to swap in order to swap out arbitrary-order > > folios (instead of PMD-sized + PTE-sized). > > When the page fault happens, does the whole folios get swapped in or break > into smaller pages? I think the whole folio should be swapped in. See my proposal for determining the correct size folio to use here: https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@xxxxxxxxxxxxxxxxxxxx/ Assuming something like that gets implemented, for a large folio to be swapped out, we've had a selection of page faults on the folio, followed by a period of no faults. All of a sudden we have a fault, so I think we should bring the whole folio back in. The algorithm I outline in that email would then take care of breaking down the folio into smaller folios if it turns out they're not used.