On 6/28/2022 9:03 AM, Jason Gunthorpe wrote: > On Tue, Jun 28, 2022 at 08:48:11AM -0400, Steven Sistare wrote: >> For cpr, old qemu directly exec's new qemu, so task does not change. >> >> To support fork+exec, the ownership test needs to be deleted or modified. >> >> Pinned page accounting is another issue, as the parent counts pins in its >> mm->locked_vm. If the child unmaps, it cannot simply decrement its own >> mm->locked_vm counter. > > It is fine already: > > mm = async ? get_task_mm(dma->task) : dma->task->mm; > if (!mm) > return -ESRCH; /* process exited */ > > ret = mmap_write_lock_killable(mm); > if (!ret) { > ret = __account_locked_vm(mm, abs(npage), npage > 0, dma->task, > dma->lock_cap); > > Each 'dma' already stores a pointer to the mm that sourced it and only > manipulates the counter in that mm. AFAICT 'current' is not used > during unmap. Ah yes, no problem then. Limits become looser, though, as the child can pin an additional RLIMIT_MEMLOCK of pages. That is the natural consequence of mm->locked_vm being a per process limit, but probably not what the application wants. Another argument for switching to user->locked_vm. >> As you and I have discussed, the count is also wrong in the direct >> exec model, because exec clears mm->locked_vm. > > Really? Yikes, I thought exec would generate a new mm? Yes, exec creates a new mm with locked_vm = 0. The old locked_vm count is dropped on the floor. The existing dma points to the same task, but task->mm has changed, and dma->task->mm->locked_vm is 0. An unmap ioctl drives it negative. I have prototyped a few possible fixes. One changes vfio to use user->locked_vm. Another changes to mm->pinned_vm and preserves it during exec. A third preserves mm->locked_vm across exec, but that is not practical, because mm->locked_vm mixes vfio pins and mlocks. The mlock component must be cleared during exec, and we don't have a separate count for it. >> I am thinking vfio could count pins in struct user locked_vm to handle both >> models. The user struct and its count would persist across direct exec, >> and be shared by parent and child for fork+exec. However, that does change >> the RLIMIT_MEMLOCK value that applications must set, because the limit must >> accommodate vfio plus other sub-systems that count in user->locked_vm, which >> includes io_uring, skbuff, xdp, and perf. Plus, the limit must accommodate all >> processes of that user, not just a single process. > > We discussed this, for iommufd we are currently planning to go this > way and will See How it Goes. Yes, I have followed that thread with interest. - Steve