On Fri, Sep 23, 2022 at 05:38:12PM +0200, Jann Horn wrote: > Hi! > > I looked through some of the code related to IOMMUv2 (the thing where > the IOMMU walks the normal userspace page tables and TLB shootdowns > are replicated to the IOMMU through > mmu_notifier_ops::invalidate_range). > > I think there's a bug in the interaction between tlb_finish_mmu() and > mmu_notifier_ops::invalidate_range: In the mm_tlb_flush_nested() case, > __tlb_reset_range() sets tlb->start and tlb->end *both* to ~0. > Afterwards, tlb_finish_mmu() calls > tlb_flush_mmu()->tlb_flush_mmu_tlbonly()->mmu_notifier_invalidate_range(), > which will pass those tlb->start and tlb->end values to > mmu_notifier_ops::invalidate_range callbacks. But those callbacks > don't know about this special case and then basically only flush > virtual address ~0, making the flush useless. Yeah, that looks wrong to me, and it extends more than just the iommu drivers kvm_arch_mmu_notifier_invalidate_range() also does not handle this coding. Most likely tlb_flush_mmu_tlbonly() need to change it back to 0 to ~0? I wonder why it uses such an odd coding in the first place? Actually, maybe having mm_tlb_flush_nested() call __tlb_reset_range() to generate a 'flush all' request is just a bad idea, as we already had another bug in 7a30df49f63ad92 related to reset_range doing the wrong thing for a flush all action. > (However, pretty much every place that calls tlb_finish_mmu() first > calls mmu_notifier_invalidate_range_end() even though the > appropriate thing would probably be > mmu_notifier_invalidate_range_only_end(); and I think those two > things probably cancel each other out?) That does sound like double flushing to me, though as you note below, the invalidate_range() triggered by range_end() after the TLB flush/page freeing is functionally incorrect, so we cannot rely on it. > Also, from what I can tell, the mremap() code, in move_page_tables(), > only invokes mmu_notifier_ops::invalidate_range via the > mmu_notifier_invalidate_range_end() at the very end, long after TLB > flushes must have happened - sort of like the bug we had years ago > where mremap() was flushing the normal TLBs too late > (https://bugs.chromium.org/p/project-zero/issues/detail?id=1695). Based on the description of eb66ae03082960 I would say that yes the invalidate_range op is missing here for the same reasons the CPU flush was missing. AFAIK if we are flushing the CPU tlb then we really must also flush the CPU tlb that KVM controls, and that is primarily what invalidate_range() is used for. Which makes me wonder if the invalidate_range() hidden inside invalidate_end() is a bad idea in general - when is this need and would be correct? Isn't it better to put the invalidates near the TLB invalidates and leave start/end as purely a bracketing API, which by definition, cannot have an end that is 'too late'? Jason