On Thu, Oct 6, 2022 at 6:15 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Thu, Oct 06, 2022, Kalra, Ashish wrote: > > For the MMU invalidation notifiers we are going to make two changes > > currently: > > > > 1). Use clflush/clflushopt instead of wbinvd_on_all_cpus() for range <= 2MB. > > IMO, this isn't worth pursuing, to the point where I might object to this code > being added upstream. Avoiding WBINVD for the mmu_notifiers doesn't prevent a > malicious userspace from using SEV-induced WBINVD to effectively DoS the host, > e.g. userspace can simply ADD+DELETE memslots, or mprotect() chunks > 2mb. > I think using clflush/clflushopt is a tactical workaround for SNP VMs. As mentioned earlier by Ashish: "For SNP guests we don't need to invoke the MMU invalidation notifiers and the cache flush should be done at the point of RMP ownership change instead of mmu_notifier, which will be when the unregister_enc_region ioctl is called, but as we don't trust the userspace (which can bypass this ioctl), therefore we continue to use the MMU invalidation notifiers." So if that is true: SNP VMs also have to use mmu_notifiers for splitting the PMDs, then I think using clflush/clflushopt might be the only workaround that I know of. > Using clfushopt also effectively puts a requirement on mm/ that the notifiers > be invoked _before_ PTEs are modified in the primary MMU, otherwise KVM may not > be able to resolve the VA=>PFN, or even worse, resolve the wrong PFN. I don't understand this. Isn't it always true that MM should fire mmu_notifiers before modifying PTEs in host MMU? This should be a strict rule as in my knowledge, no? > > And no sane VMM should be modifying userspace mappings that cover SEV guest memory > at any reasonable rate. > > In other words, switching to CLFUSHOPT for SEV+SEV-ES VMs is effectively a > band-aid for the NUMA balancing issue. That's not true. KSM might also use the same mechanism. For NUMA balancing and KSM, there seems to be a pattern: blindly flushing mmu_notifier first, then try to do the actual work. I have a limited knowledge on MM, but from my observations, it looks like the property of a page being "PINNED" is very unreliable (or expensive), i.e., anyone can jump in and pin the page. So it is hard to see whether a page is truly "PINNED" or maybe just someone is "working" on it without holding the lock. Holding the refcount of a struct page requires a spinlock. I suspect that might be the reason why NUMA balancing and KSM is just aggressively firing mmu_notifiers first. I don't know if there is other stuff in MM following the same pattern. Concretely, my deep worry is the mmu_notifier in try_to_unmap_one(). I cannot enumerate all of the callers. But if there is someone who calls into this, it might be a disaster level (4K) performance lock. Hope we can prove that won't happen.