Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

Alex Williamson <alex.williamson@xxxxxxxxxx> · Thu, 24 Mar 2022 14:40:15 -0600

On Tue, 22 Mar 2022 13:15:21 -0300
Jason Gunthorpe via iommu <iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:

> On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> 
> > I'm still picking my way through the series, but the later compat
> > interface doesn't mention this difference as an outstanding issue.
> > Doesn't this difference need to be accounted in how libvirt manages VM
> > resource limits?    
> 
> AFACIT, no, but it should be checked.
> 
> > AIUI libvirt uses some form of prlimit(2) to set process locked
> > memory limits.  
> 
> Yes, and ulimit does work fully. prlimit adjusts the value:
> 
> int do_prlimit(struct task_struct *tsk, unsigned int resource,
> 		struct rlimit *new_rlim, struct rlimit *old_rlim)
> {
> 	rlim = tsk->signal->rlim + resource;
> [..]
> 		if (new_rlim)
> 			*rlim = *new_rlim;
> 
> Which vfio reads back here:
> 
> drivers/vfio/vfio_iommu_type1.c:        unsigned long pfn, limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> drivers/vfio/vfio_iommu_type1.c:        unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> 
> And iommufd does the same read back:
> 
> 	lock_limit =
> 		task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> 	npages = pages->npinned - pages->last_npinned;
> 	do {
> 		cur_pages = atomic_long_read(&pages->source_user->locked_vm);
> 		new_pages = cur_pages + npages;
> 		if (new_pages > lock_limit)
> 			return -ENOMEM;
> 	} while (atomic_long_cmpxchg(&pages->source_user->locked_vm, cur_pages,
> 				     new_pages) != cur_pages);
> 
> So it does work essentially the same.

Well, except for the part about vfio updating mm->locked_vm and iommufd
updating user->locked_vm, a per-process counter versus a per-user
counter.  prlimit specifically sets process resource limits, which get
reflected in task_rlimit.

For example, let's say a user has two 4GB VMs and they're hot-adding
vfio devices to each of them, so libvirt needs to dynamically modify
the locked memory limit for each VM.  AIUI, libvirt would look at the
VM size and call prlimit to set that value.  If libvirt does this to
both VMs, then each has a task_rlimit of 4GB.  In vfio we add pinned
pages to mm->locked_vm, so this works well.  In the iommufd loop above,
we're comparing a per-task/process limit to a per-user counter.  So I'm
a bit lost how both VMs can pin their pages here.

Am I missing some assumption about how libvirt users prlimit or
sandboxes users?

> The difference is more subtle, iouring/etc puts the charge in the user
> so it is additive with things like iouring and additively spans all
> the users processes.
> 
> However vfio is accounting only per-process and only for itself - no
> other subsystem uses locked as the charge variable for DMA pins.
> 
> The user visible difference will be that a limit X that worked with
> VFIO may start to fail after a kernel upgrade as the charge accounting
> is now cross user and additive with things like iommufd.

And that's exactly the concern.

> This whole area is a bit peculiar (eg mlock itself works differently),
> IMHO, but with most of the places doing pins voting to use
> user->locked_vm as the charge it seems the right path in today's
> kernel.

The philosophy of whether it's ultimately a better choice for the
kernel aside, if userspace breaks because we're accounting in a
per-user pool rather than a per-process pool, then our compatibility
layer ain't so transparent.

> Ceratinly having qemu concurrently using three different subsystems
> (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> RLIMIT_MEMLOCK differently cannot be sane or correct.

I think everyone would agree with that, but it also seems there are
real differences between task_rlimits and per-user vs per-process
accounting buckets and I'm confused how that's not a blocker for trying
to implement transparent compatibility.  Thanks,

Alex