RE: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

"Tian, Kevin" <kevin.tian@xxxxxxxxx> · Thu, 24 Mar 2022 03:15:20 +0000

> From: Jason Wang <jasowang@xxxxxxxxxx>
> Sent: Thursday, March 24, 2022 10:57 AM
> 
> On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin <kevin.tian@xxxxxxxxx> wrote:
> >
> > > From: Jason Wang <jasowang@xxxxxxxxxx>
> > > Sent: Thursday, March 24, 2022 10:28 AM
> > >
> > > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin <kevin.tian@xxxxxxxxx>
> wrote:
> > > >
> > > > > From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> > > > > Sent: Wednesday, March 23, 2022 12:15 AM
> > > > >
> > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > > > >
> > > > > > I'm still picking my way through the series, but the later compat
> > > > > > interface doesn't mention this difference as an outstanding issue.
> > > > > > Doesn't this difference need to be accounted in how libvirt manages
> VM
> > > > > > resource limits?
> > > > >
> > > > > AFACIT, no, but it should be checked.
> > > > >
> > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > > > memory limits.
> > > > >
> > > > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > > > >
> > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > > > >               struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > > > {
> > > > >       rlim = tsk->signal->rlim + resource;
> > > > > [..]
> > > > >               if (new_rlim)
> > > > >                       *rlim = *new_rlim;
> > > > >
> > > > > Which vfio reads back here:
> > > > >
> > > > > drivers/vfio/vfio_iommu_type1.c:        unsigned long pfn, limit =
> > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > > drivers/vfio/vfio_iommu_type1.c:        unsigned long limit =
> > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > >
> > > > > And iommufd does the same read back:
> > > > >
> > > > >       lock_limit =
> > > > >               task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >>
> > > > > PAGE_SHIFT;
> > > > >       npages = pages->npinned - pages->last_npinned;
> > > > >       do {
> > > > >               cur_pages = atomic_long_read(&pages->source_user-
> > > > > >locked_vm);
> > > > >               new_pages = cur_pages + npages;
> > > > >               if (new_pages > lock_limit)
> > > > >                       return -ENOMEM;
> > > > >       } while (atomic_long_cmpxchg(&pages->source_user->locked_vm,
> > > > > cur_pages,
> > > > >                                    new_pages) != cur_pages);
> > > > >
> > > > > So it does work essentially the same.
> > > > >
> > > > > The difference is more subtle, iouring/etc puts the charge in the user
> > > > > so it is additive with things like iouring and additively spans all
> > > > > the users processes.
> > > > >
> > > > > However vfio is accounting only per-process and only for itself - no
> > > > > other subsystem uses locked as the charge variable for DMA pins.
> > > > >
> > > > > The user visible difference will be that a limit X that worked with
> > > > > VFIO may start to fail after a kernel upgrade as the charge accounting
> > > > > is now cross user and additive with things like iommufd.
> > > > >
> > > > > This whole area is a bit peculiar (eg mlock itself works differently),
> > > > > IMHO, but with most of the places doing pins voting to use
> > > > > user->locked_vm as the charge it seems the right path in today's
> > > > > kernel.
> > > > >
> > > > > Ceratinly having qemu concurrently using three different subsystems
> > > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> > > > > RLIMIT_MEMLOCK differently cannot be sane or correct.
> > > > >
> > > > > I plan to fix RDMA like this as well so at least we can have
> > > > > consistency within qemu.
> > > > >
> > > >
> > > > I have an impression that iommufd and vfio type1 must use
> > > > the same accounting scheme given the management stack
> > > > has no insight into qemu on which one is actually used thus
> > > > cannot adapt to the subtle difference in between. in this
> > > > regard either we start fixing vfio type1 to use user->locked_vm
> > > > now or have iommufd follow vfio type1 for upward compatibility
> > > > and then change them together at a later point.
> > > >
> > > > I prefer to the former as IMHO I don't know when will be a later
> > > > point w/o certain kernel changes to actually break the userspace
> > > > policy built on a wrong accounting scheme...
> > >
> > > I wonder if the kernel is the right place to do this. We have new uAPI
> >
> > I didn't get this. This thread is about that VFIO uses a wrong accounting
> > scheme and then the discussion is about the impact of fixing it to the
> > userspace.
> 
> It's probably too late to fix the kernel considering it may break the userspace.
> 
> > I didn't see the question on the right place part.
> 
> I meant it would be easier to let userspace know the difference than
> trying to fix or workaround in the kernel.

Jason already plans to fix RDMA which will also leads to user-aware
impact when Qemu uses both VFIO and RDMA. Why is VFIO so special
and left behind to carry the legacy misdesign?

> 
> >
> > > so management layer can know the difference of the accounting in
> > > advance by
> > >
> > > -device vfio-pci,iommufd=on
> > >
> >
> > I suppose iommufd will be used once Qemu supports it, as long as
> > the compatibility opens that Jason/Alex discussed in another thread
> > are well addressed. It is not necessarily to be a control knob exposed
> > to the caller.
> 
> It has a lot of implications if we do this, it means iommufd needs to
> inherit all the userspace noticeable behaviour as well as the "bugs"
> of VFIO.
> 
> We know it's easier to find the difference than saying no difference.
> 

In the end vfio type1 will be replaced by iommufd compat layer. With
that goal in mind iommufd has to inherit type1 behaviors.

Thanks
Kevin