> From: Jason Gunthorpe <jgg@xxxxxxxxxx> > Sent: Wednesday, March 23, 2022 12:15 AM > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote: > > > I'm still picking my way through the series, but the later compat > > interface doesn't mention this difference as an outstanding issue. > > Doesn't this difference need to be accounted in how libvirt manages VM > > resource limits? > > AFACIT, no, but it should be checked. > > > AIUI libvirt uses some form of prlimit(2) to set process locked > > memory limits. > > Yes, and ulimit does work fully. prlimit adjusts the value: > > int do_prlimit(struct task_struct *tsk, unsigned int resource, > struct rlimit *new_rlim, struct rlimit *old_rlim) > { > rlim = tsk->signal->rlim + resource; > [..] > if (new_rlim) > *rlim = *new_rlim; > > Which vfio reads back here: > > drivers/vfio/vfio_iommu_type1.c: unsigned long pfn, limit = > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > drivers/vfio/vfio_iommu_type1.c: unsigned long limit = > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > And iommufd does the same read back: > > lock_limit = > task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> > PAGE_SHIFT; > npages = pages->npinned - pages->last_npinned; > do { > cur_pages = atomic_long_read(&pages->source_user- > >locked_vm); > new_pages = cur_pages + npages; > if (new_pages > lock_limit) > return -ENOMEM; > } while (atomic_long_cmpxchg(&pages->source_user->locked_vm, > cur_pages, > new_pages) != cur_pages); > > So it does work essentially the same. > > The difference is more subtle, iouring/etc puts the charge in the user > so it is additive with things like iouring and additively spans all > the users processes. > > However vfio is accounting only per-process and only for itself - no > other subsystem uses locked as the charge variable for DMA pins. > > The user visible difference will be that a limit X that worked with > VFIO may start to fail after a kernel upgrade as the charge accounting > is now cross user and additive with things like iommufd. > > This whole area is a bit peculiar (eg mlock itself works differently), > IMHO, but with most of the places doing pins voting to use > user->locked_vm as the charge it seems the right path in today's > kernel. > > Ceratinly having qemu concurrently using three different subsystems > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for > RLIMIT_MEMLOCK differently cannot be sane or correct. > > I plan to fix RDMA like this as well so at least we can have > consistency within qemu. > I have an impression that iommufd and vfio type1 must use the same accounting scheme given the management stack has no insight into qemu on which one is actually used thus cannot adapt to the subtle difference in between. in this regard either we start fixing vfio type1 to use user->locked_vm now or have iommufd follow vfio type1 for upward compatibility and then change them together at a later point. I prefer to the former as IMHO I don't know when will be a later point w/o certain kernel changes to actually break the userspace policy built on a wrong accounting scheme... Thanks Kevin