On Tue, Aug 02, 2022 at 02:06:37PM +0200, David Hildenbrand wrote: > On 02.08.22 00:35, Peter Xu wrote: > > On Mon, Aug 01, 2022 at 10:21:32AM +0200, David Hildenbrand wrote: > >> On 29.07.22 03:40, Peter Xu wrote: > >>> [Marking as RFC; only x86 is supported for now, plan to add a few more > >>> archs when there's a formal version] > >>> > >>> Problem > >>> ======= > >>> > >>> When migrate a page, right now we always mark the migrated page as old. > >>> The reason could be that we don't really know whether the page is hot or > >>> cold, so we could have taken it a default negative assuming that's safer. > >>> > >>> However that could lead to at least two problems: > >>> > >>> (1) We lost the real hot/cold information while we could have persisted. > >>> That information shouldn't change even if the backing page is changed > >>> after the migration, > >>> > >>> (2) There can be always extra overhead on the immediate next access to > >>> any migrated page, because hardware MMU needs cycles to set the young > >>> bit again (as long as the MMU supports). > >>> > >>> Many of the recent upstream works showed that (2) is not something trivial > >>> and actually very measurable. In my test case, reading 1G chunk of memory > >>> - jumping in page size intervals - could take 99ms just because of the > >>> extra setting on the young bit on a generic x86_64 system, comparing to 4ms > >>> if young set. > >>> > >>> This issue is originally reported by Andrea Arcangeli. > >>> > >>> Solution > >>> ======== > >>> > >>> To solve this problem, this patchset tries to remember the young bit in the > >>> migration entries and carry it over when recovering the ptes. > >>> > >>> We have the chance to do so because in many systems the swap offset is not > >>> really fully used. Migration entries use swp offset to store PFN only, > >>> while the PFN is normally not as large as swp offset and normally smaller. > >>> It means we do have some free bits in swp offset that we can use to store > >>> things like young, and that's how this series tried to approach this > >>> problem. > >>> > >>> One tricky thing here is even though we're embedding the information into > >>> swap entry which seems to be a very generic data structure, the number of > >>> bits that are free is still arch dependent. Not only because the size of > >>> swp_entry_t differs, but also due to the different layouts of swap ptes on > >>> different archs. > >>> > >>> Here, this series requires specific arch to define an extra macro called > >>> __ARCH_SWP_OFFSET_BITS represents the size of swp offset. With this > >>> information, the swap logic can know whether there's extra bits to use, > >>> then it'll remember the young bits when possible. By default, it'll keep > >>> the old behavior of keeping all migrated pages cold. > >>> > >> > >> > >> I played with a similar idea when working on pte_swp_exclusive() but > >> gave up, because it ended up looking too hacky. Looking at patch #2, I > >> get the same feeling again. Kind of hacky. > > > > Could you explain what's the "hacky" part you mentioned? > > SWP_PFN_OFFSET_FREE_BITS :) > > It's a PFN offset and we're mangling in random other bits. That's hacky > IMHO. > > I played with the idea of converting all code to store bits in addition > to the type + offset. But that requires digging through a lot of arch > code to teach that code about additional flags, so I discarded that idea > when working on the COW fixes. Having SWP_PFN_OFFSET_FREE_BITS was the cleanest approach I could think of before I know the max_swapfile_size() trick. It only needs the arch to define this one macro and swapops.h will take action accordingly. OTOH, I don't want to use swap pte bits not only because there aren't a lot left for some archs (x86_64 only has bit 4 left, afaict), but also since this is a page migration issue it'll be nicer to me if can be solved in the swp entry level, not pte. > > > > > I used swap entry to avoid per-arch operations. I failed to figure out a > > common way to know swp offset length myself so unluckily in this RFC I > > still needed one macro per-arch. Ying's suggestion seems to be a good fit > > here to me to remove the last arch-specific dependency. > > Instead of mangling this into the PFN offset and let the arch tell you > which bits of the PFN offset are unused ... rather remove the bits from > the offset and define them manually to have a certain meaning. That's > exactly how pte_swp_mkexclusive/pte_swp_exclusive/ is supposed to be > handled on architectures that want to support it. > > I hope I could make it clearer what the hacky part is IMHO :) > > > > >> > >> > >> If we mostly only care about x86_64, and it's a performance improvement > >> after all, why not simply do it like > >> pte_swp_mkexclusive/pte_swp_exclusive/ ... and reuse a spare PTE bit? > > > > Page migration works for most archs, I want to have it work for all archs > > that can easily benefit from it. > > Yet we only care about x86-64 IIUC regarding performance, just the way > the dirty bit is handled? I don't think we only care about x86_64? Should other archs have the same issue as long as there's the hardware young bit? Even without it, it'll affect page reclaim logic too, and that's also not x86 only. > > > > > Besides I actually have a question on the anon exclusive bit in the swap > > pte: since we have that anyway, why we need a specific migration type for > > anon exclusive pages? Can it be simply read migration entries with anon > > exclusive bit set? > > Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/. > > As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap > PTEs, you could even reuse that bit for migration entries and get at > alteast the most relevant 64bit architectures supported easily. Yes, but I think having two mechanisms for the single problem can confuse people. IIUC the swap bit is already defined in major archs anyway, and since anon exclusive bit is best-effort (or am I wrong?..), I won't worry too much on archs outside x86/arm/ppc/s390 on having anon exclusive bit lost during migrations, because afaict the whole swap type of ANON_EXCLUSIVE_READ is only servicing that very minority.. which seems to be a pity to waste the swp type on all archs even if the archs defined swp pte bits just for anon exclusive. Thanks, -- Peter Xu