On Tue, 2022-03-22 at 11:57 -0300, Jason Gunthorpe wrote: > On Tue, Mar 22, 2022 at 03:28:22PM +0100, Niklas Schnelle wrote: > > On Fri, 2022-03-18 at 14:27 -0300, Jason Gunthorpe wrote: > > > Following the pattern of io_uring, perf, skb, and bpf iommfd will use > > iommufd ----^ > > > user->locked_vm for accounting pinned pages. Ensure the value is included > > > in the struct and export free_uid() as iommufd is modular. > > > > > > user->locked_vm is the correct accounting to use for ulimit because it is > > > per-user, and the ulimit is not supposed to be per-process. Other > > > places (vfio, vdpa and infiniband) have used mm->pinned_vm and/or > > > mm->locked_vm for accounting pinned pages, but this is only per-process > > > and inconsistent with the majority of the kernel. > > > > Since this will replace parts of vfio this difference seems > > significant. Can you explain this a bit more? > > I'm not sure what to say more, this is the correct way to account for > this. It is natural to see it is right because the ulimit is supposted > to be global to the user, not effectively reset every time the user > creates a new process. > > So checking the ulimit against a per-process variable in the mm_struct > doesn't make too much sense. Yes I agree that logically this makes more sense. I was kind of aiming in the same direction as Alex i.e. it's a conscious decision to do it right and we need to know where this may lead to differences and how to handle them. > > > I'm also a bit confused how io_uring handles this. When I stumbled over > > the problem fixed by 6b7898eb180d ("io_uring: fix imbalanced sqo_mm > > accounting") and from that commit description I seem to rember that > > io_uring also accounts in mm->locked_vm too? > > locked/pinned_pages in the mm is kind of a debugging counter, it > indicates how many pins the user obtained through this mm. AFAICT its > only correct use is to report through proc. Things are supposed to > update it, but there is no reason to limit it as the user limit > supersedes it. > > The commit you pointed at is fixing that io_uring corrupted the value. > > Since VFIO checks locked/pinned_pages against the ulimit would blow up > when the value was wrong. > > > In fact I stumbled over that because the wrong accounting in > > io_uring exhausted the applied to vfio (I was using a QEMU utilizing > > io_uring itself). > > I'm pretty interested in this as well, do you have anything you can > share? This was the issue reported in the following BZ. https://bugzilla.kernel.org/show_bug.cgi?id=209025 I stumbled over the same problem on my x86 box and also on s390. I don't remember exactly what limit this ran into but I suspect it had something to do with the libvirt resource limits Alex mentioned. Meaning io_uring had an accounting bug and then vfio / QEMU couldn't pin memory. I think that libvirt limit is set to allow pinning all of guest memory plus a bit so the io_uring misaccounting easily tipped it over. > > Thanks, > Jason