Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

David Matlack <dmatlack@xxxxxxxxxx> · Tue, 2 Apr 2024 10:38:24 -0700

On Wed, Mar 20, 2024 at 5:56 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 20.03.24 01:50, Sean Christopherson wrote:
> > Rework KVM to mark folios dirty when creating shadow/secondary PTEs (SPTEs),
> > i.e. when creating mappings for KVM guests, instead of when zapping or
> > modifying SPTEs, e.g. when dropping mappings.
> >
> > The motivation is twofold:
> >
> >    1. Marking folios dirty and accessed when zapping can be extremely
> >       expensive and wasteful, e.g. if KVM shattered a 1GiB hugepage into
> >       512*512 4KiB SPTEs for dirty logging, then KVM marks the huge folio
> >       dirty and accessed for all 512*512 SPTEs.
> >
> >    2. x86 diverges from literally every other architecture, which updates
> >       folios when mappings are created.  AFAIK, x86 is unique in that it's
> >       the only KVM arch that prefetches PTEs, so it's not quite an apples-
> >       to-apples comparison, but I don't see any reason for the dirty logic
> >       in particular to be different.
> >
>
> Already sorry for the lengthy reply.
>
>
> On "ordinary" process page tables on x86, it behaves as follows:
>
> 1) A page might be mapped writable but the PTE might not be dirty. Once
>     written to, HW will set the PTE dirty bit.
>
> 2) A page might be mapped but the PTE might not be young. Once accessed,
>     HW will set the PTE young bit.
>
> 3) When zapping a page (zap_present_folio_ptes), we transfer the dirty
>     PTE bit to the folio (folio_mark_dirty()), and the young PTE bit to
>     the folio (folio_mark_accessed()). The latter is done conditionally
>     only (vma_has_recency()).
>
> BUT, when zapping an anon folio, we don't do that, because there zapping
> implies "gone for good" and not "content must go to a file".
>
> 4) When temporarily unmapping a folio for migration/swapout, we
>     primarily only move the dirty PTE bit to the folio.
>
>
> GUP is different, because the PTEs might change after we pinned the page
> and wrote to it. We don't modify the PTEs and expect the GUP user to do
> the right thing (set dirty/accessed). For example,
> unpin_user_pages_dirty_lock() would mark the page dirty when unpinning,
> where the PTE might long be gone.
>
> So GUP does not really behave like HW access.
>
>
> Secondary page tables are different to ordinary GUP, and KVM ends up
> using GUP to some degree to simulate HW access; regarding NUMA-hinting,
> KVM already revealed to be very different to all other GUP users. [1]
>
> And I recall that at some point I raised that we might want to have a
> dedicate interface for these "mmu-notifier" based page table
> synchonization mechanism.
>
> But KVM ends up setting folio dirty/access flags itself, like other GUP
> users. I do wonder if secondary page tables should be messing with folio
> flags *at all*, and if there would be ways to to it differently using PTEs.
>
> We make sure to synchronize the secondary page tables to the process
> page tables using MMU notifiers: when we write-protect/unmap a PTE, we
> write-protect/unmap the SPTE. Yet, we handle accessed/dirty completely
> different.

Accessed bits have the test/clear young MMU-notifiers. But I agree
there aren't any notifiers for dirty tracking.

Are there any cases where the primary MMU transfers the PTE dirty bit
to the folio _other_ than zapping (which already has an MMU-notifier
to KVM). If not then there might not be any reason to add a new
notifier. Instead the contract should just be that secondary MMUs must
also transfer their dirty bits to folios in sync (or before) the
primary MMU zaps its PTE.

>
>
> I once had the following idea, but I am not sure about all implications,
> just wanted to raise it because it matches the topic here:
>
> Secondary page tables kind-of behave like "HW" access. If there is a
> write access, we would expect the original PTE to become dirty, not the
> mapped folio.

Propagating SPTE dirty bits to folios indirectly via the primary MMU
PTEs won't work for guest_memfd where there is no primary MMU PTE. In
order to avoid having two different ways to propagate SPTE dirty bits,
KVM should probably be responsible for updating the folio directly.