On 11/8/2016 4:46 AM, Alex Williamson wrote: > On Sat, 5 Nov 2016 02:40:44 +0530 > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote: > ... >> -static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma) >> +static int __vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr, >> + int prot, unsigned long *pfn_base, >> + bool do_accounting) >> +{ >> + struct task_struct *task = dma->task; >> + unsigned long limit = task_rlimit(task, RLIMIT_MEMLOCK) >> PAGE_SHIFT; >> + bool lock_cap = dma->mlock_cap; >> + struct mm_struct *mm = dma->addr_space->mm; >> + int ret; >> + bool rsvd; >> + >> + ret = vaddr_get_pfn(mm, vaddr, prot, pfn_base); >> + if (ret) >> + return ret; >> + >> + rsvd = is_invalid_reserved_pfn(*pfn_base); >> + >> + if (!rsvd && !lock_cap && mm->locked_vm + 1 > limit) { >> + put_pfn(*pfn_base, prot); >> + pr_warn("%s: Task %s (%d) RLIMIT_MEMLOCK (%ld) exceeded\n", >> + __func__, task->comm, task_pid_nr(task), >> + limit << PAGE_SHIFT); >> + return -ENOMEM; >> + } >> + >> + if (!rsvd && do_accounting) >> + vfio_lock_acct(mm, 1); >> + >> + return 1; >> +} >> + >> +static void __vfio_unpin_page_external(struct vfio_addr_space *addr_space, >> + unsigned long pfn, int prot, >> + bool do_accounting) >> +{ >> + put_pfn(pfn, prot); >> + >> + if (do_accounting) >> + vfio_lock_acct(addr_space->mm, -1); > > Can't we batch this like we do elsewhere? Intel folks, AIUI you intend > to pin all VM memory through this side channel, have you tested the > scalability and performance of this with larger VMs? Our vfio_pfn > data structure alone is 40 bytes per pinned page, which means for > each 1GB of VM memory, we have 10MBs worth of struct vfio_pfn! > Additionally, unmapping each 1GB of VM memory will result in 256k > separate vfio_lock_acct() callbacks. I'm concerned that we're not > being efficient enough in either space or time. > > One thought might be whether we really need to save the pfn, we better > always get the same result if we pin it again, or maybe we can just do > a lookup through the mm at that point without re-pinning. Could we get > to the point where we only need an atomic_t ref count per page in a > linear array relative to the IOVA? Ok. Is System RAM hot-plug supported? How is system RAM hot-plug handled? Are there DMA_MAP calls on such hot-plug for additional range? If we have a linear array/memory, we will have to realloc it on memory hot-plug? > That would give us 1MB per 1GB > overhead. The semantics of the pin and unpin would make more sense then > too, both would take an IOVA range, only pinning would need a return > mechanism. For instance: > > int pin_pages(void *iommu_data, dma_addr_t iova_base, > int npage, unsigned long *pfn_base); > > This would pin physically contiguous pages up to npage, returning the > base pfn and returning the number of pages pinned (<= npage). The > vendor driver would make multiple calls to fill the necessary range. With the current patch, input is user_pfn[] array and npages. int vfio_pin_pages(struct device *dev, unsigned long *user_pfn, int npage, int prot, unsigned long *phys_pfn) When guest allocates memory with malloc(), gfns would not be contiguous, right? These gfns (user_pfns) are passed as argument here. Is there any case where we could get pin/unpin request for contiguous pages? > Unpin would then simply be: > > void unpin_pages(void *iommu_data, dma_addr_t iova_base, int npage); > > Hugepage usage would really make such an interface shine (ie. 2MB+ > contiguous ranges). A downside would be the overhead of getting the > group and container reference in vfio for each callback, perhaps we'd > need to figure out how the vendor driver could hold that reference. In very initial phases of proposal, I had suggested to keep pointer to container->iommu_data in struct mdev_device. But that was discarded. > The current API of passing around pfn arrays, further increasing the > overhead of the whole ecosystem just makes me cringe though. > ... >> + if (ret <= 0) { >> + WARN_ON(!ret); >> + goto pin_unwind; >> + } >> + >> + mutex_lock(&dma->addr_space->pfn_list_lock); >> + >> + /* search if pfn exist */ >> + p = vfio_find_pfn(dma->addr_space, pfn[i]); >> + if (p) { >> + atomic_inc(&p->ref_count); > > We never test whether (p->prot == prot), shouldn't we be doing > something in that case? In fact, why do we allow the side-channel > through the .{un}pin_pages to specify page protection flags that might > be different than the user specified for the DMA_MAP? If the user > specified read-only, the vendor driver should not be allowed to > override with read-write access. > If user specified protection flags for DMA_MAP could be prot = IOMMU_WRITE | IOMMU_READ; But vendor driver can request to pin page to be readonly, i.e. IOMMU_READ. In that case, pin pages should be allowed, right? Then the check should be if (p->prot & prot). Thanks, Kirti -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html