On Thu, Sep 29, 2022 at 07:31:03PM -0700, John Hubbard wrote: > > Ah, right OK. This is specifically because the iommu is sharing the > > *exact* page table of the CPU so the trick KVM/etc uses where 'start' > > makes the shadow PTE non-present and then delays the fault until end > > completes cannot work here. > > ohhh, is this trick something I should read more about, if I'm about to > jump in here? All the invalidate_start implementations have a parallel page table structure that they copy from the main page table into, eg using hmm_range_fault() or whatever kvm does. Thus we can create a situation where a sPTE is non-present while a PTE is present. The iommu/etc point the HW at exactly the same page table as the mm, so we cannot create a situation where a sPTE is non-present while the PTE is present. Thus, IMHO, the only correct answer is to flush the shadow TLB at exactly the same moments we flush the CPU TLB because the two TLBs are processing exactly the same data structure. If there is logic that the TLB flush can be delayed to optimize it then that logic applies equally to both TLBs. > > So, then we can see where the end_only thing came from, commit > > 0f10851ea475 ("mm/mmu_notifier: avoid double notification when it is > > useless") and that long winded message explains why some of the cases > > I seem to recall that there was a performance drop involving GPUs, due > to the double notification. Just to fill in a little bit of history as > to why Jerome was trying to deduplicate the notifier callbacks. Sure, the double notification is clearly undesired, but the right way to fix that was to remove the call to invalidate_range() from mn_hlist_invalidate_end() which means adding any missing calls to invalidate_range() near the CPU TLB However, as Sean suggested, GPU drivers should not use invalidate_range() because they are not directly sharing the CPU page table in the first place. They should use start/end. Maybe the docs need clarification on this point as well. > After an initial pass through this, with perhaps 80% understanding > of the story, I'm reading that as: > > Audit all the sites (which you initially did quickly, above) > that 0f10851ea475 touched, and any other related ones, and > change things so that invalidate_range() and primary TLB > flushing happen at the same point(s). > > Yes? Anything else? I would structure it as 1) Make a patch fixing the documentation around all of the mmu_notifier_invalidate_range_only_end() to explain where the invalidate_range() call is (eg where the CPU TLB flush is) or why the CPU TLB flush is not actually needed. 0f10851ea475 is a good guide where to touch Remove all the confusing documentation about write protect and 'promotion'. The new logic is we call invalidate_range when we flush the CPU TLB and we might do that outside the start/end block. 2) Make patch(es) converting all the places calling mmu_notifier_invalidate_range_end() to only_end() by identify where the CPU TLB flush is and ensuring the invalidate_range is present at the CPU TLB flush point. eg all the range_end() calls on the fork() path are switched to only_end() and an invalidate_range() put outside the start/end/block in dup_mmap() near the TLB flush. This is even more optimal because we batch flushing the entire shadown TLB in one shot, instead of trying to invalidate every VMA range. 3) Fix the TLB flusher to not send -1 -1 in the corner Jann noticed in the first mail Jason