On Tue, Oct 22, 2019 at 11:59:35AM +0530, Bharata B Rao wrote: > On Fri, Oct 18, 2019 at 8:31 AM Paul Mackerras <paulus@xxxxxxxxxx> wrote: > > > > On Wed, Sep 25, 2019 at 10:36:43AM +0530, Bharata B Rao wrote: > > > Manage migration of pages betwen normal and secure memory of secure > > > guest by implementing H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls. > > > > > > H_SVM_PAGE_IN: Move the content of a normal page to secure page > > > H_SVM_PAGE_OUT: Move the content of a secure page to normal page > > > > > > Private ZONE_DEVICE memory equal to the amount of secure memory > > > available in the platform for running secure guests is created. > > > Whenever a page belonging to the guest becomes secure, a page from > > > this private device memory is used to represent and track that secure > > > page on the HV side. The movement of pages between normal and secure > > > memory is done via migrate_vma_pages() using UV_PAGE_IN and > > > UV_PAGE_OUT ucalls. > > > > As we discussed privately, but mentioning it here so there is a > > record: I am concerned about this structure > > > > > +struct kvmppc_uvmem_page_pvt { > > > + unsigned long *rmap; > > > + struct kvm *kvm; > > > + unsigned long gpa; > > > +}; > > > > which keeps a reference to the rmap. The reference could become stale > > if the memslot is deleted or moved, and nothing in the patch series > > ensures that the stale references are cleaned up. > > I will add code to release the device PFNs when memslot goes away. In > fact the early versions of the patchset had this, but it subsequently > got removed. > > > > > If it is possible to do without the long-term rmap reference, and > > instead find the rmap via the memslots (with the srcu lock held) each > > time we need the rmap, that would be safer, I think, provided that we > > can sort out the lock ordering issues. > > All paths except fault handler access rmap[] under srcu lock. Even in > case of fault handler, for those faults induced by us (shared page > handling, releasing device pfns), we do hold srcu lock. The difficult > case is when we fault due to HV accessing a device page. In this case > we come to fault hanler with mmap_sem already held and are not in a > position to take kvm srcu lock as that would lead to lock order > reversal. Given that we have pages mapped in still, I assume memslot > can't go away while we access rmap[], so think we should be ok here. The mapping of pages in userspace memory, and the mapping of userspace memory to guest physical space, are two distinct things. The memslots describe the mapping of userspace addresses to guest physical addresses, but don't say anything about what is mapped at those userspace addresses. So you can indeed get a page fault on a userspace address at the same time that a memslot is being deleted (even a memslot that maps that particular userspace address), because removing the memslot does not unmap anything from userspace memory, it just breaks the association between that userspace memory and guest physical memory. Deleting the memslot does unmap the pages from the guest but doesn't unmap them from the userspace process (e.g. QEMU). It is an interesting question what the semantics should be when a memslot is deleted and there are pages of userspace currently paged out to the device (i.e. the ultravisor). One approach might be to say that all those pages have to come back to the host before we finish the memslot deletion, but that is probably not necessary; I think we could just say that those pages are gone and can be replaced by zero pages if they get accessed on the host side. If userspace then unmaps the corresponding region of the userspace memory map, we can then just forget all those pages with very little work. > However if that sounds fragile, may be I can go back to my initial > design where we weren't using rmap[] to store device PFNs. That will > increase the memory usage but we give us an easy option to have > per-guest mutex to protect concurrent page-ins/outs/faults. That sounds like it would be the best option, even if only in the short term. At least it would give us a working solution, even if it's not the best performing solution. Paul.