On Wed, 12 Apr 2017 12:13:43 +0800 Peter Xu <peterx@xxxxxxxxxx> wrote: > On Tue, Apr 11, 2017 at 12:50:11PM -0600, Alex Williamson wrote: > > On Tue, 11 Apr 2017 12:27:55 -0600 > > Alex Williamson <alex.williamson@xxxxxxxxxx> wrote: > > > > > On Tue, 11 Apr 2017 19:03:14 +0800 > > > Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > > > > On Thu, Apr 06, 2017 at 08:53:43AM -0600, Alex Williamson wrote: > > [...] > > > > > > -static void vfio_lock_acct(struct task_struct *task, long npage) > > > > > -{ > > > > > - struct vwork *vwork; > > > > > struct mm_struct *mm; > > > > > bool is_current; > > > > > + int ret; > > > > > > > > > > if (!npage) > > > > > - return; > > > > > + return 0; > > > > > > > > > > is_current = (task->mm == current->mm); > > > > > > > > > > mm = is_current ? task->mm : get_task_mm(task); > > > > > > > > A question besides current patch: could I ask why we need to take > > > > special care for is_current? I see that is only used to only try avoid > > > > get_task_mm() when proper, but is get_task_mm() a heavy operation? > > > > > > Yes, it's slower, performance was significantly degraded when mdev > > > support was introduced and imposed get_task_mm() on all calling paths. > > I see. Thanks. > > > > > > > > > if (!mm) > > > > > - return; /* process exited */ > > > > > + return -ESRCH; /* process exited */ > > > > > > > > > > - if (down_write_trylock(&mm->mmap_sem)) { > > > > > - mm->locked_vm += npage; > > > > > - up_write(&mm->mmap_sem); > > > > > - if (!is_current) > > > > > - mmput(mm); > > > > > - return; > > > > > - } > > > > > + ret = down_write_killable(&mm->mmap_sem); > > > > > + if (!ret) { > > > > > + if (npage < 0) { > > > > > + mm->locked_vm += npage; > > > > > + } else { > > > > > + unsigned long limit; > > > > > + > > > > > + limit = is_current ? > > > > > + rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT : > > > > > + task_rlimit(task, RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > > > > > Maybe we can directly use task_rlimit() here? Since looks like > > > > rlimit() is calling it as well, with "current". > > > > > > We could, but does it actually change anything? rlimit() is static > > > inline, so using task_rlimit() for both just moves the is_current > > > ternary into the task_rlimit() argument. Is this: > > > > > > limit = task_rlimit(is_current ? current : task, > > > RLIMIT_MEMLOCK) >> PAGE_SHIFT); > > > > > > notably cleaner than above? > > > > Ah, maybe you were suggesting that we can just use "task" here for > > both since it's always correct. Thanks, > > Yes it is. > > [...] > > > > > > out: > > > > > - vfio_lock_acct(current, lock_acct); > > > > > + ret = vfio_lock_acct(current, lock_acct); > > > > > + > > > > > +unpin_out: > > > > > + if (ret) { > > > > > + if (!rsvd) { > > > > > + for (pfn = *pfn_base ; pinned ; pfn++, pinned--) > > > > > + put_pfn(pfn, dma->prot); > > > > > + } > > > > > + > > > > > + return ret; > > > > > + } > > > > > > > > The change in vfio_pin_pages_remote() seems to contain a functional > > > > change totally not related to the subject (IIUC, we are going to unpin > > > > those pages if the huge page can only be pinned partially, and we are > > > > not doing that before)? If so, would it be nice to split current patch > > > > into two, or at least mention this behavior change in commit log of > > > > this patch? > > > > > > > > > This is only tangentially about hugepages, this loop is looking for > > > contiguous pages regardless of the processor or IOMMU page size > > > support. > > It should somewhat related to huge pages? At least we have > disable_hugepages parameter, and as well in vfio_pin_pages_remote() we > have: > > if (unlikely(disable_hugepages)) > goto out; > > So the loop will be skipped if that is specified. Right, that's why I say tangentially, the IOMMU driver cannot make use of any sort of superpages if we map at PAGE_SIZE granularity, at least not until they implement something like transparent hugepage support for the IOMMU. But the point is that the type1 vfio IOMMU backend isn't directly aware of hugepages. > > > We're trying to make as few calls to iommu_map() as we can > > > and thus we want the maximum range of IOVA to physical address we can > > > pump into the IOMMU driver. It's up to the IOMMU driver to figure out > > > how it can optimize that range with hugepages or superpages in its page > > > tables. So the caller of this function is looping over a range of > > > memory and each time this function returns, it maps that many pages > > > through the iommu. On success we return <=npage. > > > > > > The unpin_out loop here is absolutely related to the change proposed > > > here, vfio_lock_acct() can fail, we cannot return both an error and pin > > > pages, therefore we need to undo anything we've pinned this round. > > Yes you are right. It's related. > > > > > > > Are you perhaps only referring to the exit path above going straight to > > > this loop rather than attempting to do the accounting for the pages > > > pinned so far? Previously that was our only option because the unwind > > > path was to return a short count, invoking the page accounting and > > > iommu_mapping, while fully expecting the caller to again loop over the > > > excess page, return -ENOMEM, and teardown the entire mapping request. > > > So because we now require an unwind path for the vfio_lock_acct() > > > change, we can now do the teardown w/o the additional pinning here and > > > mapping by the caller. In a very strict sense, we could consider that > > > a separate change and move those 3 lines to a follow-on patch but > > > ultimately the caller did request pinned pages beyond what we believe > > > their limit to be and making use of this new exit path here saves us a > > > useless accounting and mapping iteration. I can note that in the > > > commit log. Thanks, > > I agree that in all cases it's a corner case and trivial, and the > userspace caller should anyway do something to release its memory > accounting. There's actually no userspace visible change. Before or after, if the user attempts to map more pages than their locked memory limit, they get -ENOMEM. The difference is only whether we handle the condition lazily by completing pinning and mapping up to the limit, before failing and tearing down at the next pass, or undo the work of the current pass and return error to teardown any previous work. Either path all happens within the mapping ioctl. > Thanks for addressing my comments and replied with full detail! Thanks for reviewing! Alex