On Wed, Sep 06, 2023, David Stevens wrote: > On Wed, Sep 6, 2023 at 9:45 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > On Tue, Sep 05, 2023, David Stevens wrote: > > > For property 2, FOLL_GET is also important. If guarded_by_mmu_notifier > > > is set, then we're all good here. If guarded_by_mmu_notifier is not > > > set, then the check in __kvm_follow_pfn guarantees that FOLL_GET is > > > set. For struct page memory, we're safe because KVM will hold a > > > reference as long as it's still using the page. For non struct page > > > memory, we're not safe - this is where the breaking change of > > > allow_unsafe_mappings would go. Note that for non-refcounted struct > > > page, we can't use the allow_unsafe_mappings escape hatch. Since > > > FOLL_GET was requested, if we returned such a page, then the caller > > > would eventually corrupt the page refcount via kvm_release_pfn. > > > > Yes we can. The caller simply needs to be made aware of is_refcounted_page. I > > didn't include that in the snippet below because I didn't want to write the entire > > patch. The whole point of adding is_refcounted_page is so that callers can > > identify exactly what type of page was at the end of the trail that was followed. > > Are you asking me to completely migrate every caller of any gfn_to_pfn > variant to __kvm_follow_pfn, so that they can respect > is_refcounted_page? That's the only way to make it safe for > allow_unsafe_mappings to apply to non-refcounted pages. That is > decidedly not simple. Or is kvm_vcpu_map the specific call site you > care about? At best, I can try to migrate x86, and then just add some > sort of compatibility shim for other architectures that rejects > non-refcounted pages. Ah, I see your conundrum. No, I don't think it's reasonable to require you to convert all users in every architecture. I'll still ask, just in case you're feeling generous, but it's not a requirement :-) The easiest way forward I can think of is to add yet another flag to kvm_follow_pfn, e.g. allow_non_refcounted_struct_page, to communicate whether or not the caller has been enlightened to play nice with non-refcounted struct page memory. We'll need that flag no matter what, otherwise we'd have to convert all users in a single patch (LOL). Then your series can simply stop at a reasonable point, e.g. convert all x86 usage (including kvm_vcpu_map(), and leave converting everything else to future work. E.g. I think this would be the outro of hva_to_pfn_remapped(): if (!page) goto out; if (get_page_unless_zero(page)) WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn); out: pte_unmap_unlock(ptep, ptl); /* * TODO: Drop allow_non_refcounted_struct_page once all callers have * been taught to play nice with non-refcounted tail pages. */ if (page && !foll->is_refcounted_page && !foll->allow_non_refcounted_struct_page) r = -EFAULT else if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier && !allow_unsafe_mappings) r = -EFAULT; else *p_pfn = pfn; return r; > > > Property 3 would be nice, but we've already concluded that guarding > > > all translations with mmu notifiers is infeasible. So maintaining > > > property 2 is the best we can hope for. > > > > No, #3 is just a variant of #2. Unless you're talking about not making guarantees > > about guest accesses being ordered with respect to VMA/memslot updates, but I > > don't think that's the case. > > I'm talking about the fact that kvm_vcpu_map is busted with respect to > updates to VMA updates. It won't corrupt host memory because the > mapping keeps a reference to the page, but it will continue to use > stale translations. True. But barring some crazy paravirt use case, userspace modifying a mapping that is in active use is inherently broken, the guest will have no idea that memory just got yanked away. Hmm, though I suppose userspace could theoretically mprotect() a mapping to be read-only, which would "work" for mmu_notifier paths but not kvm_vcpu_map(). But KVM doesn't provide enough information on -EFAULT for userspace to do anything in response to a write to read-only memory, so in practice that's likely inherently broken too. > From [1], it sounds like you've granted that fixing that is not feasible, so > I just wanted to make sure that this isn't the "unsafe" referred to by > allow_unsafe_mappings. Right, this is not the "unsafe" I'm referring to. > [1] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@xxxxxxxxxx/