Peter Xu <peterx@xxxxxxxxxx> writes: > [Marking as RFC; only x86 is supported for now, plan to add a few more > archs when there's a formal version] > > Problem > ======= > > When migrate a page, right now we always mark the migrated page as old. > The reason could be that we don't really know whether the page is hot or > cold, so we could have taken it a default negative assuming that's safer. > > However that could lead to at least two problems: > > (1) We lost the real hot/cold information while we could have persisted. > That information shouldn't change even if the backing page is changed > after the migration, > > (2) There can be always extra overhead on the immediate next access to > any migrated page, because hardware MMU needs cycles to set the young > bit again (as long as the MMU supports). > > Many of the recent upstream works showed that (2) is not something trivial > and actually very measurable. In my test case, reading 1G chunk of memory > - jumping in page size intervals - could take 99ms just because of the > extra setting on the young bit on a generic x86_64 system, comparing to 4ms > if young set. LKP has observed that before too, as in the following reports and discussion. https://lore.kernel.org/all/87bn35zcko.fsf@xxxxxxxxxxxxxxxxxxxx/t/ Best Regards, Huang, Ying > This issue is originally reported by Andrea Arcangeli. > > Solution > ======== > > To solve this problem, this patchset tries to remember the young bit in the > migration entries and carry it over when recovering the ptes. > > We have the chance to do so because in many systems the swap offset is not > really fully used. Migration entries use swp offset to store PFN only, > while the PFN is normally not as large as swp offset and normally smaller. > It means we do have some free bits in swp offset that we can use to store > things like young, and that's how this series tried to approach this > problem. > > One tricky thing here is even though we're embedding the information into > swap entry which seems to be a very generic data structure, the number of > bits that are free is still arch dependent. Not only because the size of > swp_entry_t differs, but also due to the different layouts of swap ptes on > different archs. > > Here, this series requires specific arch to define an extra macro called > __ARCH_SWP_OFFSET_BITS represents the size of swp offset. With this > information, the swap logic can know whether there's extra bits to use, > then it'll remember the young bits when possible. By default, it'll keep > the old behavior of keeping all migrated pages cold. > > Tests > ===== > > After the patchset applied, the immediate read access test [1] of above 1G > chunk after migration can shrink from 99ms to 4ms. The test is done by > moving 1G pages from node 0->1->0 then read it in page size jumps. > > Currently __ARCH_SWP_OFFSET_BITS is only defined on x86 for this series and > only tested on x86_64 with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. > > Patch Layout > ============ > > Patch 1: Add swp_offset_pfn() and apply to all pfn swap entries, we should > also stop treating swp_offset() as PFN anymore because it can > contain more information starting from next patch. > Patch 2: The core patch to remember young bit in swap offsets. > Patch 3: A cleanup for x86 32 bits pgtable.h. > Patch 4: Define __ARCH_SWP_OFFSET_BITS on x86, enable young bit for migration > > Please review, thanks. > > [1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c > > Peter Xu (4): > mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry > mm: Remember young bit for page migrations > mm/x86: Use SWP_TYPE_BITS in 3-level swap macros > mm/x86: Define __ARCH_SWP_OFFSET_BITS > > arch/arm64/mm/hugetlbpage.c | 2 +- > arch/x86/include/asm/pgtable-2level.h | 6 ++ > arch/x86/include/asm/pgtable-3level.h | 15 +++-- > arch/x86/include/asm/pgtable_64.h | 5 ++ > include/linux/swapops.h | 85 +++++++++++++++++++++++++-- > mm/hmm.c | 2 +- > mm/huge_memory.c | 10 +++- > mm/memory-failure.c | 2 +- > mm/migrate.c | 4 +- > mm/migrate_device.c | 2 + > mm/page_vma_mapped.c | 6 +- > mm/rmap.c | 3 +- > 12 files changed, 122 insertions(+), 20 deletions(-)