On Thu, Mar 24, 2022 at 11:15 AM Tian, Kevin <kevin.tian@xxxxxxxxx> wrote: > > > From: Jason Wang <jasowang@xxxxxxxxxx> > > Sent: Thursday, March 24, 2022 10:57 AM > > > > On Thu, Mar 24, 2022 at 10:42 AM Tian, Kevin <kevin.tian@xxxxxxxxx> wrote: > > > > > > > From: Jason Wang <jasowang@xxxxxxxxxx> > > > > Sent: Thursday, March 24, 2022 10:28 AM > > > > > > > > On Thu, Mar 24, 2022 at 10:12 AM Tian, Kevin <kevin.tian@xxxxxxxxx> > > wrote: > > > > > > > > > > > From: Jason Gunthorpe <jgg@xxxxxxxxxx> > > > > > > Sent: Wednesday, March 23, 2022 12:15 AM > > > > > > > > > > > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote: > > > > > > > > > > > > > I'm still picking my way through the series, but the later compat > > > > > > > interface doesn't mention this difference as an outstanding issue. > > > > > > > Doesn't this difference need to be accounted in how libvirt manages > > VM > > > > > > > resource limits? > > > > > > > > > > > > AFACIT, no, but it should be checked. > > > > > > > > > > > > > AIUI libvirt uses some form of prlimit(2) to set process locked > > > > > > > memory limits. > > > > > > > > > > > > Yes, and ulimit does work fully. prlimit adjusts the value: > > > > > > > > > > > > int do_prlimit(struct task_struct *tsk, unsigned int resource, > > > > > > struct rlimit *new_rlim, struct rlimit *old_rlim) > > > > > > { > > > > > > rlim = tsk->signal->rlim + resource; > > > > > > [..] > > > > > > if (new_rlim) > > > > > > *rlim = *new_rlim; > > > > > > > > > > > > Which vfio reads back here: > > > > > > > > > > > > drivers/vfio/vfio_iommu_type1.c: unsigned long pfn, limit = > > > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > > > drivers/vfio/vfio_iommu_type1.c: unsigned long limit = > > > > > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > > > > > > > > > And iommufd does the same read back: > > > > > > > > > > > > lock_limit = > > > > > > task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> > > > > > > PAGE_SHIFT; > > > > > > npages = pages->npinned - pages->last_npinned; > > > > > > do { > > > > > > cur_pages = atomic_long_read(&pages->source_user- > > > > > > >locked_vm); > > > > > > new_pages = cur_pages + npages; > > > > > > if (new_pages > lock_limit) > > > > > > return -ENOMEM; > > > > > > } while (atomic_long_cmpxchg(&pages->source_user->locked_vm, > > > > > > cur_pages, > > > > > > new_pages) != cur_pages); > > > > > > > > > > > > So it does work essentially the same. > > > > > > > > > > > > The difference is more subtle, iouring/etc puts the charge in the user > > > > > > so it is additive with things like iouring and additively spans all > > > > > > the users processes. > > > > > > > > > > > > However vfio is accounting only per-process and only for itself - no > > > > > > other subsystem uses locked as the charge variable for DMA pins. > > > > > > > > > > > > The user visible difference will be that a limit X that worked with > > > > > > VFIO may start to fail after a kernel upgrade as the charge accounting > > > > > > is now cross user and additive with things like iommufd. > > > > > > > > > > > > This whole area is a bit peculiar (eg mlock itself works differently), > > > > > > IMHO, but with most of the places doing pins voting to use > > > > > > user->locked_vm as the charge it seems the right path in today's > > > > > > kernel. > > > > > > > > > > > > Ceratinly having qemu concurrently using three different subsystems > > > > > > (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for > > > > > > RLIMIT_MEMLOCK differently cannot be sane or correct. > > > > > > > > > > > > I plan to fix RDMA like this as well so at least we can have > > > > > > consistency within qemu. > > > > > > > > > > > > > > > > I have an impression that iommufd and vfio type1 must use > > > > > the same accounting scheme given the management stack > > > > > has no insight into qemu on which one is actually used thus > > > > > cannot adapt to the subtle difference in between. in this > > > > > regard either we start fixing vfio type1 to use user->locked_vm > > > > > now or have iommufd follow vfio type1 for upward compatibility > > > > > and then change them together at a later point. > > > > > > > > > > I prefer to the former as IMHO I don't know when will be a later > > > > > point w/o certain kernel changes to actually break the userspace > > > > > policy built on a wrong accounting scheme... > > > > > > > > I wonder if the kernel is the right place to do this. We have new uAPI > > > > > > I didn't get this. This thread is about that VFIO uses a wrong accounting > > > scheme and then the discussion is about the impact of fixing it to the > > > userspace. > > > > It's probably too late to fix the kernel considering it may break the userspace. > > > > > I didn't see the question on the right place part. > > > > I meant it would be easier to let userspace know the difference than > > trying to fix or workaround in the kernel. > > Jason already plans to fix RDMA which will also leads to user-aware > impact when Qemu uses both VFIO and RDMA. Why is VFIO so special > and left behind to carry the legacy misdesign? It's simply because we don't want to break existing userspace. [1] > > > > > > > > > > so management layer can know the difference of the accounting in > > > > advance by > > > > > > > > -device vfio-pci,iommufd=on > > > > > > > > > > I suppose iommufd will be used once Qemu supports it, as long as > > > the compatibility opens that Jason/Alex discussed in another thread > > > are well addressed. It is not necessarily to be a control knob exposed > > > to the caller. > > > > It has a lot of implications if we do this, it means iommufd needs to > > inherit all the userspace noticeable behaviour as well as the "bugs" > > of VFIO. > > > > We know it's easier to find the difference than saying no difference. > > > > In the end vfio type1 will be replaced by iommufd compat layer. With > that goal in mind iommufd has to inherit type1 behaviors. So the compatibility should be provided by the compat layer instead of the core iommufd. And I wonder what happens if iommufd is used by other subsystems like vDPA. Does it mean vDPA needs to inherit type1 behaviours? If yes, do we need a per subsystem new uAPI to expose this capability? If yes, why can't VFIO have such an API then we don't even need the compat layer at all? Thanks [1] https://lkml.org/lkml/2012/12/23/75 > > Thanks > Kevin