On Wed, Apr 03, 2024, David Hildenbrand wrote: > On 03.04.24 02:17, Sean Christopherson wrote: > > On Tue, Apr 02, 2024, David Hildenbrand wrote: > > Aha! But try_to_unmap_one() also checks that refcount==mapcount+1, i.e. will > > also keep the folio if it has been GUP'd. And __remove_mapping() explicitly states > > that it needs to play nice with a GUP'd page being marked dirty before the > > reference is dropped. > > > > > * Must be careful with the order of the tests. When someone has > > * a ref to the folio, it may be possible that they dirty it then > > * drop the reference. So if the dirty flag is tested before the > > * refcount here, then the following race may occur: > > > > So while it's totally possible for KVM to get a W=1,D=0 PTE, if I'm reading the > > code correctly it's safe/legal so long as KVM either (a) marks the folio dirty > > while holding a reference or (b) marks the folio dirty before returning from its > > mmu_notifier_invalidate_range_start() hook, *AND* obviously if KVM drops its > > mappings in response to mmu_notifier_invalidate_range_start(). > > > > Yes, I agree that it should work in the context of vmscan. But (b) is > certainly a bit harder to swallow than "ordinary" (a) :) Heh, all the more reason to switch KVM x86 from (b) => (a). > As raised, if having a writable SPTE would imply having a writable+dirty > PTE, then KVM MMU code wouldn't have to worry about syncing any dirty bits > ever back to core-mm, so patch #2 would not be required. ... well, it would > be replaces by an MMU notifier that notifies about clearing the PTE dirty > bit :) Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need to be invalidated before consuming dirty status. Isn't the end result essentially a sane FOLL_TOUCH? > ... because, then, there is also a subtle difference between > folio_set_dirty() and folio_mark_dirty(), and I am still confused about the > difference and not competent enough to explain the difference ... and KVM > always does the former, while zapping code of pagecache folios does the > latter ... hm Ugh, just when I thought I finally had my head wrapped around this. > Related note: IIRC, we usually expect most anon folios to be dirty. > > kvm_set_pfn_dirty()->kvm_set_page_dirty() does an unconditional > SetPageDirty()->folio_set_dirty(). Doing a test-before-set might frequently > avoid atomic ops. Noted, definitely worth poking at.