On Thu, Apr 04, 2024, David Hildenbrand wrote: > On 04.04.24 19:31, Sean Christopherson wrote: > > On Thu, Apr 04, 2024, David Hildenbrand wrote: > > > On 04.04.24 00:19, Sean Christopherson wrote: > > > > Hmm, we essentially already have an mmu_notifier today, since secondary MMUs need > > > > to be invalidated before consuming dirty status. Isn't the end result essentially > > > > a sane FOLL_TOUCH? > > > > > > Likely. As stated in my first mail, FOLL_TOUCH is a bit of a mess right now. > > > > > > Having something that makes sure the writable PTE/PMD is dirty (or > > > alternatively sets it dirty), paired with MMU notifiers notifying on any > > > mkclean would be one option that would leave handling how to handle dirtying > > > of folios completely to the core. It would behave just like a CPU writing to > > > the page table, which would set the pte dirty. > > > > > > Of course, if frequent clearing of the dirty PTE/PMD bit would be a problem > > > (like we discussed for the accessed bit), that would not be an option. But > > > from what I recall, only clearing the PTE/PMD dirty bit is rather rare. > > > > And AFAICT, all cases already invalidate secondary MMUs anyways, so if anything > > it would probably be a net positive, e.g. the notification could more precisely > > say that SPTEs need to be read-only, not blasted away completely. > > As discussed, I think at least madvise_free_pte_range() wouldn't do that. I'm getting a bit turned around. Are you talking about what madvise_free_pte_range() would do in this future world, or what madvise_free_pte_range() does today? Because today, unless I'm really misreading the code, secondary MMUs are invalidated before the dirty bit is cleared. mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, range.start, range.end); lru_add_drain(); tlb_gather_mmu(&tlb, mm); update_hiwater_rss(mm); mmu_notifier_invalidate_range_start(&range); tlb_start_vma(&tlb, vma); walk_page_range(vma->vm_mm, range.start, range.end, &madvise_free_walk_ops, &tlb); tlb_end_vma(&tlb, vma); mmu_notifier_invalidate_range_end(&range); KVM (or any other secondary MMU) can re-establish mapping with W=1,D=0 in the PTE, but the costly invalidation (zap+flush+fault) still happens. > Notifiers would only get called later when actually zapping the folio. And in case we're talking about a hypothetical future, I was thinking the above could do MMU_NOTIFY_WRITE_PROTECT instead of MMU_NOTIFY_CLEAR. > So at least for some time, you would have the PTE not dirty, but the SPTE > writable or even dirty. So you'd have to set the page dirty when zapping the > SPTE ... and IMHO that is what we should maybe try to avoid :)