On Fri, Nov 8, 2019 at 5:43 PM Sean Christopherson <sean.j.christopherson@xxxxxxxxx> wrote: > > On Thu, Nov 07, 2019 at 07:58:46AM -0800, Sean Christopherson wrote: > > On Thu, Nov 07, 2019 at 07:36:45AM -0800, Dan Williams wrote: > > > On Thu, Nov 7, 2019 at 3:12 AM Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote: > > > > > > > > On 07/11/19 06:48, Dan Williams wrote: > > > > >> How do mmu notifiers get held off by page references and does that > > > > >> machinery work with ZONE_DEVICE? Why is this not a concern for the > > > > >> VM_IO and VM_PFNMAP case? > > > > > Put another way, I see no protection against truncate/invalidate > > > > > afforded by a page pin. If you need guarantees that the page remains > > > > > valid in the VMA until KVM can install a mmu notifier that needs to > > > > > happen under the mmap_sem as far as I can see. Otherwise gup just > > > > > weakly asserts "this pinned page was valid in this vma at one point in > > > > > time". > > > > > > > > The MMU notifier is installed before gup, so any invalidation will be > > > > preceded by a call to the MMU notifier. In turn, > > > > invalidate_range_start/end is called with mmap_sem held so there should > > > > be no race. > > > > > > > > However, as Sean mentioned, early put_page of ZONE_DEVICE pages would be > > > > racy, because we need to keep the reference between the gup and the last > > > > time we use the corresponding struct page. > > > > > > If KVM is establishing the mmu_notifier before gup then there is > > > nothing left to do with that ZONE_DEVICE page, so I'm struggling to > > > see what further qualification of kvm_is_reserved_pfn() buys the > > > implementation. > > > > Insertion into KVM's secondary MMU is mutually exclusive with an invalidate > > from the mmu_notifier. KVM holds a reference to the to-be-inserted page > > until the page has been inserted, which ensures that the page is pinned and > > thus won't be invalidated until after the page is inserted. This prevents > > an invalidate from racing with insertion. Dropping the reference > > immediately after gup() would allow the invalidate to run prior to the page > > being inserted, and so KVM would map the stale PFN into the guest's page > > tables after it was invalidated in the host. > > My previous analysis is wrong, although I did sort of come to the right > conclusion. > > The part that's wrong is that KVM does not rely on pinning a page/pfn when > installing the pfn into its secondary MMU (guest page tables). Instead, > KVM keeps track of mmu_notifier invalidate requests and cancels insertion > if an invalidate occured at any point between the start of hva_to_pfn(), > i.e. the get_user_pages() call, and acquiring KVM's mmu lock (which must > also be grabbed by mmu_notifier invalidate). So for any pfn, regardless > of whether it's backed by a struct page, KVM inserts a pfn if and only if > it is guaranteed to get an mmu_notifier invalidate for the pfn (and isn't > already invalidated). > > In the page fault flow, KVM doesn't care whether or not the pfn remains > valid in the associated vma. In other words, Dan's idea of immediately > doing put_page() on ZONE_DEVICE pages would work for *page faults*... > > ...but not for all the other flows where KVM uses gfn_to_pfn(), and thus > get_user_pages(). When accessing entire pages of guest memory, e.g. for > nested virtualization, KVM gets the page associated with a gfn, maps it > with kmap() to get a kernel address and keeps the mapping/page until it's > done reading/writing the page. Immediately putting ZONE_DEVICE pages > would result in use-after-free scenarios for these flows. Thanks for this clarification. I do want to put out though that ZONE_DEVICE pages go idle, they don't get freed. As long as KVM drops its usage on invalidate it's perfectly fine for KVM to operate on idle ZONE_DEVICE pages. The common case is that ZONE_DEVICE pages are accessed and mapped while idle. Only direct-I/O temporarily marks them busy to synchronize with invalidate. KVM obviates that need by coordinating with mmu-notifiers instead.