Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

David Hildenbrand <david@xxxxxxxxxx> · Wed, 20 Mar 2024 13:56:41 +0100

On 20.03.24 01:50, Sean Christopherson wrote:
Rework KVM to mark folios dirty when creating shadow/secondary PTEs (SPTEs),
i.e. when creating mappings for KVM guests, instead of when zapping or
modifying SPTEs, e.g. when dropping mappings.

The motivation is twofold:

   1. Marking folios dirty and accessed when zapping can be extremely
      expensive and wasteful, e.g. if KVM shattered a 1GiB hugepage into
      512*512 4KiB SPTEs for dirty logging, then KVM marks the huge folio
      dirty and accessed for all 512*512 SPTEs.

   2. x86 diverges from literally every other architecture, which updates
      folios when mappings are created.  AFAIK, x86 is unique in that it's
      the only KVM arch that prefetches PTEs, so it's not quite an apples-
      to-apples comparison, but I don't see any reason for the dirty logic
      in particular to be different.

Already sorry for the lengthy reply.

On "ordinary" process page tables on x86, it behaves as follows:

1) A page might be mapped writable but the PTE might not be dirty. Once
   written to, HW will set the PTE dirty bit.

2) A page might be mapped but the PTE might not be young. Once accessed,
   HW will set the PTE young bit.

3) When zapping a page (zap_present_folio_ptes), we transfer the dirty
   PTE bit to the folio (folio_mark_dirty()), and the young PTE bit to
   the folio (folio_mark_accessed()). The latter is done conditionally
   only (vma_has_recency()).

BUT, when zapping an anon folio, we don't do that, because there zapping 
implies "gone for good" and not "content must go to a file".

4) When temporarily unmapping a folio for migration/swapout, we
   primarily only move the dirty PTE bit to the folio.

GUP is different, because the PTEs might change after we pinned the page 
and wrote to it. We don't modify the PTEs and expect the GUP user to do 
the right thing (set dirty/accessed). For example, 
unpin_user_pages_dirty_lock() would mark the page dirty when unpinning, 
where the PTE might long be gone.

So GUP does not really behave like HW access.

Secondary page tables are different to ordinary GUP, and KVM ends up 
using GUP to some degree to simulate HW access; regarding NUMA-hinting, 
KVM already revealed to be very different to all other GUP users. [1]

And I recall that at some point I raised that we might want to have a 
dedicate interface for these "mmu-notifier" based page table 
synchonization mechanism.

But KVM ends up setting folio dirty/access flags itself, like other GUP 
users. I do wonder if secondary page tables should be messing with folio 
flags *at all*, and if there would be ways to to it differently using PTEs.

We make sure to synchronize the secondary page tables to the process 
page tables using MMU notifiers: when we write-protect/unmap a PTE, we 
write-protect/unmap the SPTE. Yet, we handle accessed/dirty completely 
different.

I once had the following idea, but I am not sure about all implications, 
just wanted to raise it because it matches the topic here:

Secondary page tables kind-of behave like "HW" access. If there is a 
write access, we would expect the original PTE to become dirty, not the 
mapped folio.

1) When KVM wants to map a page into the secondary page table, we
   require the PTE to be young (like a HW access). The SPTE can remain
   old.

2) When KVM wants to map a page writable into the secondary page table,
   we require the PTE to be dirty (like a HW access). The SPTE can
   remain old.

3) When core MM clears the PTE dirty/young bit, we notify the secondary
   page  table to adjust: for example, if the dirty bit gets cleared,
   the page cannot be writable in the secondary MMU.

4) GUP-fast cannot set the pte dirty/young, so we would fallback to slow
   GUP, wehre we hold the PTL, and simply modify the PTE to have the
   accessed/dirty bit set.

5) Prefetching would similarly be limited to that (only prefetch if PTE
   is already dirty etc.).

6) Dirty/accessed bits not longer have to be synced from the secondary
   page table to the process page table. Because an SPTE being dirty
   implies that the PTE is dirty.

One tricky bit, why ordinary GUP modifies the folio and not the PTE, is 
concurrent HW access. For example, when we want to mark a PTE accessed, 
it could happen that HW concurrently tries marking the PTE dirty. We 
must not lose that update, so we have to guarantee an atomic update 
(maybe avoidable in some cases).

What would be the implications? We'd leave setting folio flags to the MM 
core. That also implies, that if you shutdown a VM an zap all anon 
folios, you wouldn't have to mark any folio dirty: the pte is dirty, and 
MM core can decide to ignore that flag since it will discard the page 
either way.

Downsides? Likely many I have not yet thought about (TLB flushes etc). 
Just mentioning it because in context of [1] I was wondering if 
something that uses MMU notifiers should really be messing with 
dirty/young flags :)

I tagged this RFC as it is barely tested, and because I'm not 100% positive
there isn't some weird edge case I'm missing, which is why I Cc'd David H.
and Matthew.

We'd be in trouble if someone would detect that all PTEs are clean, so 
it can clear the folio dirty flag (for example, after writeback). Then, 
we would write using the SPTE and the folio+PTE would be clean. If we 
then evict the "clean" folio that is actually dirty, we would be in trouble.

Well, we would set the SPTE dirty flag I guess. But I cannot immediately 
tell if that one would be synced back to the folio? Would we have a 
mechanism in place to prevent that?

Note, I'm going to be offline from ~now until April 1st.  I rushed this out
as it could impact David S.'s kvm_follow_pfn series[*], which is imminent.
E.g. if KVM stops marking pages dirty and accessed everywhere, adding
SPTE_MMU_PAGE_REFCOUNTED just to sanity check that the refcount is elevated
seems like a poor tradeoff (medium complexity and annoying to maintain, for
not much benefit).

Regarding David S.'s series, I wouldn't be at all opposed to going even
further and having x86 follow all architectures by marking pages accessed
_only_ at map time, at which point I think KVM could simply pass in FOLL_TOUCH
as appropriate, and thus dedup a fair bit of arch code.

FOLL_TOUCH is weird (excluding weird devmap stuff):

1) For PTEs (follow_page_pte), we set the page dirty and accessed, and
   do not modify the PTE. For THP (follow_trans_huge_pmd), we set the
   PMD young/dirty and don't mess with the folio.

2) FOLL_TOUCH is not implemented for hugetlb.

3) FOLL_TOUCH is not implemented for GUP-fast.

I'd leave that alone :)

[1] 
https://lore.kernel.org/lkml/20230727212845.135673-1-david@xxxxxxxxxx/T/#u
--
Cheers,

David / dhildenb