On Mon, Aug 01, 2022 at 10:21:32AM +0200, David Hildenbrand wrote: > On 29.07.22 03:40, Peter Xu wrote: > > [Marking as RFC; only x86 is supported for now, plan to add a few more > > archs when there's a formal version] > > > > Problem > > ======= > > > > When migrate a page, right now we always mark the migrated page as old. > > The reason could be that we don't really know whether the page is hot or > > cold, so we could have taken it a default negative assuming that's safer. > > > > However that could lead to at least two problems: > > > > (1) We lost the real hot/cold information while we could have persisted. > > That information shouldn't change even if the backing page is changed > > after the migration, > > > > (2) There can be always extra overhead on the immediate next access to > > any migrated page, because hardware MMU needs cycles to set the young > > bit again (as long as the MMU supports). > > > > Many of the recent upstream works showed that (2) is not something trivial > > and actually very measurable. In my test case, reading 1G chunk of memory > > - jumping in page size intervals - could take 99ms just because of the > > extra setting on the young bit on a generic x86_64 system, comparing to 4ms > > if young set. > > > > This issue is originally reported by Andrea Arcangeli. > > > > Solution > > ======== > > > > To solve this problem, this patchset tries to remember the young bit in the > > migration entries and carry it over when recovering the ptes. > > > > We have the chance to do so because in many systems the swap offset is not > > really fully used. Migration entries use swp offset to store PFN only, > > while the PFN is normally not as large as swp offset and normally smaller. > > It means we do have some free bits in swp offset that we can use to store > > things like young, and that's how this series tried to approach this > > problem. > > > > One tricky thing here is even though we're embedding the information into > > swap entry which seems to be a very generic data structure, the number of > > bits that are free is still arch dependent. Not only because the size of > > swp_entry_t differs, but also due to the different layouts of swap ptes on > > different archs. > > > > Here, this series requires specific arch to define an extra macro called > > __ARCH_SWP_OFFSET_BITS represents the size of swp offset. With this > > information, the swap logic can know whether there's extra bits to use, > > then it'll remember the young bits when possible. By default, it'll keep > > the old behavior of keeping all migrated pages cold. > > > > > I played with a similar idea when working on pte_swp_exclusive() but > gave up, because it ended up looking too hacky. Looking at patch #2, I > get the same feeling again. Kind of hacky. Could you explain what's the "hacky" part you mentioned? I used swap entry to avoid per-arch operations. I failed to figure out a common way to know swp offset length myself so unluckily in this RFC I still needed one macro per-arch. Ying's suggestion seems to be a good fit here to me to remove the last arch-specific dependency. > > > If we mostly only care about x86_64, and it's a performance improvement > after all, why not simply do it like > pte_swp_mkexclusive/pte_swp_exclusive/ ... and reuse a spare PTE bit? Page migration works for most archs, I want to have it work for all archs that can easily benefit from it. Besides I actually have a question on the anon exclusive bit in the swap pte: since we have that anyway, why we need a specific migration type for anon exclusive pages? Can it be simply read migration entries with anon exclusive bit set? Thanks, -- Peter Xu