Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed

David Hildenbrand <david@xxxxxxxxxx> · Tue, 2 Apr 2024 20:31:35 +0200

On 02.04.24 19:38, David Matlack wrote:
On Wed, Mar 20, 2024 at 5:56 AM David Hildenbrand <david@xxxxxxxxxx> wrote:

On 20.03.24 01:50, Sean Christopherson wrote:
Rework KVM to mark folios dirty when creating shadow/secondary PTEs (SPTEs),
i.e. when creating mappings for KVM guests, instead of when zapping or
modifying SPTEs, e.g. when dropping mappings.

The motivation is twofold:

    1. Marking folios dirty and accessed when zapping can be extremely
       expensive and wasteful, e.g. if KVM shattered a 1GiB hugepage into
       512*512 4KiB SPTEs for dirty logging, then KVM marks the huge folio
       dirty and accessed for all 512*512 SPTEs.

    2. x86 diverges from literally every other architecture, which updates
       folios when mappings are created.  AFAIK, x86 is unique in that it's
       the only KVM arch that prefetches PTEs, so it's not quite an apples-
       to-apples comparison, but I don't see any reason for the dirty logic
       in particular to be different.

Already sorry for the lengthy reply.

On "ordinary" process page tables on x86, it behaves as follows:

1) A page might be mapped writable but the PTE might not be dirty. Once
     written to, HW will set the PTE dirty bit.

2) A page might be mapped but the PTE might not be young. Once accessed,
     HW will set the PTE young bit.

3) When zapping a page (zap_present_folio_ptes), we transfer the dirty
     PTE bit to the folio (folio_mark_dirty()), and the young PTE bit to
     the folio (folio_mark_accessed()). The latter is done conditionally
     only (vma_has_recency()).

BUT, when zapping an anon folio, we don't do that, because there zapping
implies "gone for good" and not "content must go to a file".

4) When temporarily unmapping a folio for migration/swapout, we
     primarily only move the dirty PTE bit to the folio.

GUP is different, because the PTEs might change after we pinned the page
and wrote to it. We don't modify the PTEs and expect the GUP user to do
the right thing (set dirty/accessed). For example,
unpin_user_pages_dirty_lock() would mark the page dirty when unpinning,
where the PTE might long be gone.

So GUP does not really behave like HW access.

Secondary page tables are different to ordinary GUP, and KVM ends up
using GUP to some degree to simulate HW access; regarding NUMA-hinting,
KVM already revealed to be very different to all other GUP users. [1]

And I recall that at some point I raised that we might want to have a
dedicate interface for these "mmu-notifier" based page table
synchonization mechanism.

But KVM ends up setting folio dirty/access flags itself, like other GUP
users. I do wonder if secondary page tables should be messing with folio
flags *at all*, and if there would be ways to to it differently using PTEs.

We make sure to synchronize the secondary page tables to the process
page tables using MMU notifiers: when we write-protect/unmap a PTE, we
write-protect/unmap the SPTE. Yet, we handle accessed/dirty completely
different.

Accessed bits have the test/clear young MMU-notifiers. But I agree
there aren't any notifiers for dirty tracking.

Yes, and I am questioning if the "test" part should exist -- or if 
having a spte in the secondary MMU should require the access bit to be 
set (derived from the primary MMU). (again, my explanation about fake HW 
page table walkers)

There might be a good reason to do it like that nowadays, so I'm only 
raising it as something I was wondering. Likely, frequent clearing of 
the access bit would result in many PTEs in the secondary MMU getting 
invalidated, requiring a new GUP-fast lookup where we would set the 
access bit in the primary MMU PTE. But I'm not an expert on the 
implications with MMU notifiers and access bit clearing.

Are there any cases where the primary MMU transfers the PTE dirty bit
to the folio _other_ than zapping (which already has an MMU-notifier
to KVM). If not then there might not be any reason to add a new
notifier. Instead the contract should just be that secondary MMUs must
also transfer their dirty bits to folios in sync (or before) the
primary MMU zaps its PTE.

Grepping for pte_mkclean(), there might be some cases. Many cases use 
MMU notifier, because they either clear the PTE or also remove write 
permissions.

But these is madvise_free_pte_range() and 
clean_record_shared_mapping_range()...->clean_record_pte(), that might 
only clear the dirty bit without clearing/changing permissions and 
consequently not calling MMU notifiers.

Getting a writable PTE without the dirty bit set should be possible.

So I am questioning whether having a writable PTE in the secondary MMU 
with a clean PTE in the primary MMU should be valid to exist. It can 
exist today, and I am not sure if that's the right approach.

I once had the following idea, but I am not sure about all implications,
just wanted to raise it because it matches the topic here:

Secondary page tables kind-of behave like "HW" access. If there is a
write access, we would expect the original PTE to become dirty, not the
mapped folio.

Propagating SPTE dirty bits to folios indirectly via the primary MMU
PTEs won't work for guest_memfd where there is no primary MMU PTE. In
order to avoid having two different ways to propagate SPTE dirty bits,
KVM should probably be responsible for updating the folio directly.

But who really cares about access/dirty bits for guest_memfd?

guest_memfd already wants to disable/bypass all of core-MM, so different 
rules are to be expected. This discussion is all about integration with 
core-MM that relies on correct dirty bits, which does not really apply 
to guest_memfd.

--
Cheers,

David / dhildenb