On Fri, Sep 23, 2022, Jason Gunthorpe wrote: > On Fri, Sep 23, 2022 at 05:38:12PM +0200, Jann Horn wrote: > > Hi! > > > > I looked through some of the code related to IOMMUv2 (the thing where > > the IOMMU walks the normal userspace page tables and TLB shootdowns > > are replicated to the IOMMU through > > mmu_notifier_ops::invalidate_range). > > > > I think there's a bug in the interaction between tlb_finish_mmu() and > > mmu_notifier_ops::invalidate_range: In the mm_tlb_flush_nested() case, > > __tlb_reset_range() sets tlb->start and tlb->end *both* to ~0. > > Afterwards, tlb_finish_mmu() calls > > tlb_flush_mmu()->tlb_flush_mmu_tlbonly()->mmu_notifier_invalidate_range(), > > which will pass those tlb->start and tlb->end values to > > mmu_notifier_ops::invalidate_range callbacks. But those callbacks > > don't know about this special case and then basically only flush > > virtual address ~0, making the flush useless. > > Yeah, that looks wrong to me, and it extends more than just the iommu > drivers kvm_arch_mmu_notifier_invalidate_range() also does not handle > this coding. FWIW, the bug is likely benign for KVM. KVM does almost all of its TLB flushing via invalidate_range_{start,end}(), the invalidate_range() hook is used only by x86/VMX to react to a specific KVM-allocated page being migrated (the page is only ever unmapped when the VM is dying). > Most likely tlb_flush_mmu_tlbonly() need to change it back to 0 to ~0? > I wonder why it uses such an odd coding in the first place? > > Actually, maybe having mm_tlb_flush_nested() call __tlb_reset_range() > to generate a 'flush all' request is just a bad idea, as we already > had another bug in 7a30df49f63ad92 related to reset_range doing the > wrong thing for a flush all action. > > > (However, pretty much every place that calls tlb_finish_mmu() first > > calls mmu_notifier_invalidate_range_end() even though the > > appropriate thing would probably be > > mmu_notifier_invalidate_range_only_end(); and I think those two > > things probably cancel each other out?) > > That does sound like double flushing to me, though as you note below, > the invalidate_range() triggered by range_end() after the TLB > flush/page freeing is functionally incorrect, so we cannot rely on it. > > > Also, from what I can tell, the mremap() code, in move_page_tables(), > > only invokes mmu_notifier_ops::invalidate_range via the > > mmu_notifier_invalidate_range_end() at the very end, long after TLB > > flushes must have happened - sort of like the bug we had years ago > > where mremap() was flushing the normal TLBs too late > > (https://bugs.chromium.org/p/project-zero/issues/detail?id=1695). > > Based on the description of eb66ae03082960 I would say that yes the > invalidate_range op is missing here for the same reasons the CPU flush > was missing. > > AFAIK if we are flushing the CPU tlb then we really must also flush > the CPU tlb that KVM controls, and that is primarily what > invalidate_range() is used for. As above, for its actual secondary MMU, KVM invalidates and flushes at invalidate_range_start(), and then prevents vCPUs from creating new entries for the range until invalidate_range_start_end(). The VMX use case is for a physical address that is consumed by hardware without going through the secondary page tables; using the start/end hooks would be slightly annoying due to the need to stall the vCPU until end, and so KVM uses invalidate_range() for that one specific case. > Which makes me wonder if the invalidate_range() hidden inside > invalidate_end() is a bad idea in general - when is this need and > would be correct? Isn't it better to put the invalidates near the TLB > invalidates and leave start/end as purely a bracketing API, which by > definition, cannot have an end that is 'too late'? Documentation/mm/mmu_notifier.rst explains this, although even that is quite subtle. The argument is that if the change is purely to downgrade protections, then deferring invalidate_range() is ok because the only requirement is that secondary MMUs invalidate before the "end" of the sequence. When changing a pte to write protect or to point to a new write protected page with same content (KSM) it is fine to delay the mmu_notifier_invalidate_range call to mmu_notifier_invalidate_range_end() outside the page table lock. This is true even if the thread doing the page table update is preempted right after releasing page table lock but before call mmu_notifier_invalidate_range_end(). That said, I also dislike hiding invalidate_range() inside end(), I constantly forget about that behavior. To address that, what about renaming mmu_notifier_invalidate_range_end() to make it more explicit, e.g. mmu_notifier_invalidate_range_and_end().