Sean Christopherson <seanjc@xxxxxxxxxx> writes: > On Mon, May 22, 2023, Alistair Popple wrote: >> Some architectures, specifically ARM and perhaps Sparc and IA64, >> require TLB invalidates when upgrading pte permission from read-only >> to read-write. >> >> The current mmu_notifier implementation assumes that upgrades do not >> need notifications. Typically though mmu_notifiers are used to >> implement TLB invalidations for secondary MMUs that comply with the >> main CPU architecture. >> >> Therefore if the main CPU architecture requires an invalidation for >> permission upgrade the secondary MMU will as well and an mmu_notifier >> should be sent for the upgrade. >> >> Currently CPU invalidations for permission upgrade occur in >> ptep_set_access_flags(). Unfortunately MMU notifiers cannot be called >> directly from this architecture specific code as the notifier >> callbacks can sleep, and ptep_set_access_flags() is usually called >> whilst holding the PTL spinlock. Therefore add the notifier calls >> after the PTL is dropped and only if the PTE actually changed. This >> will allow secondary MMUs to obtain an updated PTE with appropriate >> permissions. >> >> This problem was discovered during testing of an ARM SMMU >> implementation that does not support broadcast TLB maintenance >> (BTM). In this case the SMMU driver uses notifiers to issue TLB >> invalidates. For read-only to read-write pte upgrades the SMMU >> continually returned a read-only PTE to the device, even though the >> CPU had a read-write PTE installed. >> >> Sending a mmu notifier event to the SMMU driver fixes the problem by >> flushing secondary TLB entries. A new notifier event type is added so >> drivers may filter out these invalidations if not required. Note a >> driver should never upgrade or install a PTE in response to this mmu >> notifier event as it is not synchronised against other PTE operations. >> >> Signed-off-by: Alistair Popple <apopple@xxxxxxxxxx> >> --- >> include/linux/mmu_notifier.h | 6 +++++ >> mm/hugetlb.c | 24 ++++++++++++++++++- >> mm/memory.c | 45 ++++++++++++++++++++++++++++++++++-- >> 3 files changed, 72 insertions(+), 3 deletions(-) >> >> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h >> index d6c06e140277..f14d68f119d8 100644 >> --- a/include/linux/mmu_notifier.h >> +++ b/include/linux/mmu_notifier.h >> @@ -31,6 +31,11 @@ struct mmu_interval_notifier; >> * pages in the range so to mirror those changes the user must inspect the CPU >> * page table (from the end callback). >> * >> + * @MMU_NOTIFY_PROTECTION_UPGRAGE: update is due to a change from read-only to >> + * read-write for pages in the range. This must not be used to upgrade >> + * permissions on secondary PTEs, rather it should only be used to invalidate >> + * caches such as secondary TLBs that may cache old read-only entries. > > This is a poor fit for invalidate_range_{start,end}(). All other uses bookend > the primary MMU update, i.e. call start() _before_ changing PTEs. The comments > are somewhat stale as they talk only about "unmapped", but the contract between > the primary MMU and the secondary MMU is otherwise quite clear on when the primary > MMU will invoke start() and end(). > > * invalidate_range_start() is called when all pages in the > * range are still mapped and have at least a refcount of one. > * > * invalidate_range_end() is called when all pages in the > * range have been unmapped and the pages have been freed by > * the VM. > > I'm also confused as to how this actually fixes ARM's SMMU. Unless I'm looking > at the wrong SMMU implementation, the SMMU implemenents only invalidate_range(), > not the start()/end() variants. mmu_invalidate_range_end() calls the invalidate_range() callback if the start()/end() variants aren't set. > static const struct mmu_notifier_ops arm_smmu_mmu_notifier_ops = { > .invalidate_range = arm_smmu_mm_invalidate_range, > .release = arm_smmu_mm_release, > .free_notifier = arm_smmu_mmu_notifier_free, > }; > > Again from include/linux/mmu_notifier.h, not implementing the start()/end() hooks > is perfectly valid. And AFAICT, the existing invalidate_range() hook is pretty > much a perfect fit for what you want to achieve. Right, I didn't take that approach because it doesn't allow an event type to be passed which would allow them to be filtered on platforms which don't require this. I had also assumed the invalidate_range() callbacks were allowed to sleep, hence couldn't be called under PTL. That's certainly true of mmu interval notifier callbacks, but Catalin reminded me that calls such as ptep_clear_flush_notify() already call invalidate_range() callback under PTL so I guess we already assume drivers don't sleep in their invalidate_range() callbacks. I will update the comments to reflect that. > * If invalidate_range() is used to manage a non-CPU TLB with > * shared page-tables, it not necessary to implement the > * invalidate_range_start()/end() notifiers, as > * invalidate_range() already catches the points in time when an > * external TLB range needs to be flushed. For more in depth > * discussion on this see Documentation/mm/mmu_notifier.rst > > Even worse, this change may silently regress performance for secondary MMUs that > haven't yet taken advantage of the event type, e.g. KVM will zap all of KVM's PTEs > in response to the upgrade, instead of waiting until the guest actually tries to > utilize the new protections. Yeah, I like the idea of introducing a ptep_set_access_flags_notify(). That way this won't regress performance on platforms that don't need it. Note this isn't a new feature but rather a bugfix. It's unclear to me why KVM on ARM hasn't already run into this issue, but I'm no KVM expert. Thanks for the feedback. - Alistair