On Wed, 2020-12-02 at 12:32 -0800, Ankur Arora wrote: > > On IRC, Paolo told me that permanent pinning causes problems for memory > > hotplug, and pointed me at the trick we do with an MMU notifier and > > kvm_vcpu_reload_apic_access_page(). > > Okay that answers my question. Thanks for clearing that up. > > Not sure of a good place to document this but it would be good to > have this written down somewhere. Maybe kvm_map_gfn()? Trying not to get too distracted by polishing this part, so I can continue with making more things actually work. But I took a quick look at the reload_apic_access_page() thing. AFAICT it works because the access is only from *within* the vCPU, in guest mode. So all the notifier has to do is kick all CPUs, which happens when it calls kvm_make_all_cpus_request(). Thus we are guaranteed that all CPUs are *out* of guest mode by the time... ...er... maybe not by the time the notifier returns, because all we've done is *send* the IPI and we don't know the other CPUs have actually stopped running the guest yet? Maybe there's some explanation of why the actual TLB shootdown truly *will* occur before the page goes away, and some ordering rules which mean our reschedule IPI will happen first? Something like that ideally would have been in a comment in in MMU notifier. Anyway, I'm not going to fret about that because it clearly doesn't work for the pvclock/shinfo cases anyway; those are accessed from the kernel *outside* guest mode. I think we can use a similar trick but just protect the accesses with kvm->srcu. The code in my tree now uses kvm_map_gfn() at setup time and leaves the shinfo page mapped. A bit like the KVM pvclock code, except that I don't pointlessly map and unmap all the time while leaving the page pinned anyway. So all we need to do is kvm_release_pfn() right after mapping it, leaving the map->hva "unsafe". Then we hook in to the MMU notifier to unmap it when it goes away (in RCU style; clearing the pointer variable, synchronize_srcu(), and *then* unmap, possibly in different range_start/range/range_end callbacks). The *users* of such mappings will need an RCU read lock, and will need to be able to fault the GFN back in when they need it. For now, though, I'm content that converting the Xen shinfo support to kvm_map_gfn() is the right way to go, and the memory hotplug support is an incremental addition on top of it. And I'm even more comforted by the observation that the KVM pvclock support *already* pins its pages, so I'm not even failing to meet the existing bar :)
Attachment:
smime.p7s
Description: S/MIME cryptographic signature