On 12/14/2022 11:52 PM, Tian, Kevin wrote: >> From: Steve Sistare <steven.sistare@xxxxxxxxxx> >> Sent: Thursday, December 15, 2022 5:25 AM >> >> When a vfio container is preserved across exec using the >> VFIO_UPDATE_VADDR >> interfaces, locked_vm of the new mm becomes 0. If the user later unmaps a >> dma mapping, locked_vm underflows to a large unsigned value, and a >> subsequent dma map request fails with ENOMEM in __account_locked_vm. > > Isn't it the problem more than underflow? w/o copying locked_vm to the > new mm it means rlimit of the new mm could be exceeded since only > newly pinned pages after exec are counted. Yes. I focused on underflow, because that will even cause an app that has no limit to fail, but I will add rlimit to the commit message. >> @@ -424,6 +425,10 @@ static int vfio_lock_acct(struct vfio_dma *dma, long >> npage, bool async) >> if (!mm) >> return -ESRCH; /* process exited */ >> >> + /* Avoid locked_vm underflow */ >> + if (dma->mm != mm && npage < 0) >> + goto out; >> + > > the comment is unclear why the condition will lead to underflow. > > It's also unclear why the guard is only for exec case but not fork-exec. > According to this patch locked_vm was not copied to the new mm in > both cases before this fix. Correct. > Then why would underflow only happen > in one but not in the other? Underflow does not occur for fork-exec. I was wrong when I said so in email. For fork-exec, if the original process has exited, then we return ESRCH. If the original process is alive, then dma->mm == mm, and we proceed to decrement its locked_vm. > Anyway more explanation is appreciated here. How about: /* If task exec'd, the old mm is dead. Avoid locked_mm underflow. */ >> +static int vfio_change_dma_owner(struct vfio_dma *dma) >> +{ >> + int ret = 0; >> + struct mm_struct *mm = get_task_mm(dma->task); > > should this be current->group_leader? otherwise in fork-exec case > dma->mm is still equal to dma->task->mm. Thanks, that is a bug. The code works if the original process exited, but not if it persists. The test should be: if (dma->mm != mm || new_task != dma->task) { >> + >> + if (dma->mm != mm) { >> + long npage = dma->size >> PAGE_SHIFT; >> + bool new_lock_cap = capable(CAP_IPC_LOCK); >> + struct task_struct *new_task = current->group_leader; >> + >> + ret = mmap_write_lock_killable(new_task->mm); >> + if (ret) >> + goto out; >> + >> + ret = __account_locked_vm(new_task->mm, npage, true, >> + new_task, new_lock_cap); >> + mmap_write_unlock(new_task->mm); >> + if (ret) >> + goto out; >> + >> + if (dma->task != new_task) { >> + vfio_lock_acct(dma, -npage, 0); > > Presumably this should be always done on the old mm, even in exec case. No, for exec, the old mm is a zombie. - Steve