On 02.08.22 00:35, Peter Xu wrote: > On Mon, Aug 01, 2022 at 10:21:32AM +0200, David Hildenbrand wrote: >> On 29.07.22 03:40, Peter Xu wrote: >>> [Marking as RFC; only x86 is supported for now, plan to add a few more >>> archs when there's a formal version] >>> >>> Problem >>> ======= >>> >>> When migrate a page, right now we always mark the migrated page as old. >>> The reason could be that we don't really know whether the page is hot or >>> cold, so we could have taken it a default negative assuming that's safer. >>> >>> However that could lead to at least two problems: >>> >>> (1) We lost the real hot/cold information while we could have persisted. >>> That information shouldn't change even if the backing page is changed >>> after the migration, >>> >>> (2) There can be always extra overhead on the immediate next access to >>> any migrated page, because hardware MMU needs cycles to set the young >>> bit again (as long as the MMU supports). >>> >>> Many of the recent upstream works showed that (2) is not something trivial >>> and actually very measurable. In my test case, reading 1G chunk of memory >>> - jumping in page size intervals - could take 99ms just because of the >>> extra setting on the young bit on a generic x86_64 system, comparing to 4ms >>> if young set. >>> >>> This issue is originally reported by Andrea Arcangeli. >>> >>> Solution >>> ======== >>> >>> To solve this problem, this patchset tries to remember the young bit in the >>> migration entries and carry it over when recovering the ptes. >>> >>> We have the chance to do so because in many systems the swap offset is not >>> really fully used. Migration entries use swp offset to store PFN only, >>> while the PFN is normally not as large as swp offset and normally smaller. >>> It means we do have some free bits in swp offset that we can use to store >>> things like young, and that's how this series tried to approach this >>> problem. >>> >>> One tricky thing here is even though we're embedding the information into >>> swap entry which seems to be a very generic data structure, the number of >>> bits that are free is still arch dependent. Not only because the size of >>> swp_entry_t differs, but also due to the different layouts of swap ptes on >>> different archs. >>> >>> Here, this series requires specific arch to define an extra macro called >>> __ARCH_SWP_OFFSET_BITS represents the size of swp offset. With this >>> information, the swap logic can know whether there's extra bits to use, >>> then it'll remember the young bits when possible. By default, it'll keep >>> the old behavior of keeping all migrated pages cold. >>> >> >> >> I played with a similar idea when working on pte_swp_exclusive() but >> gave up, because it ended up looking too hacky. Looking at patch #2, I >> get the same feeling again. Kind of hacky. > > Could you explain what's the "hacky" part you mentioned? SWP_PFN_OFFSET_FREE_BITS :) It's a PFN offset and we're mangling in random other bits. That's hacky IMHO. I played with the idea of converting all code to store bits in addition to the type + offset. But that requires digging through a lot of arch code to teach that code about additional flags, so I discarded that idea when working on the COW fixes. > > I used swap entry to avoid per-arch operations. I failed to figure out a > common way to know swp offset length myself so unluckily in this RFC I > still needed one macro per-arch. Ying's suggestion seems to be a good fit > here to me to remove the last arch-specific dependency. Instead of mangling this into the PFN offset and let the arch tell you which bits of the PFN offset are unused ... rather remove the bits from the offset and define them manually to have a certain meaning. That's exactly how pte_swp_mkexclusive/pte_swp_exclusive/ is supposed to be handled on architectures that want to support it. I hope I could make it clearer what the hacky part is IMHO :) > >> >> >> If we mostly only care about x86_64, and it's a performance improvement >> after all, why not simply do it like >> pte_swp_mkexclusive/pte_swp_exclusive/ ... and reuse a spare PTE bit? > > Page migration works for most archs, I want to have it work for all archs > that can easily benefit from it. Yet we only care about x86-64 IIUC regarding performance, just the way the dirty bit is handled? > > Besides I actually have a question on the anon exclusive bit in the swap > pte: since we have that anyway, why we need a specific migration type for > anon exclusive pages? Can it be simply read migration entries with anon > exclusive bit set? Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/. As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap PTEs, you could even reuse that bit for migration entries and get at alteast the most relevant 64bit architectures supported easily. -- Thanks, David / dhildenb