On 12/13/2022 2:29 PM, Alex Williamson wrote: > On Tue, 13 Dec 2022 13:21:15 -0500 > Steven Sistare <steven.sistare@xxxxxxxxxx> wrote: > >> On 12/13/2022 1:17 PM, Steven Sistare wrote: >>> On 12/13/2022 1:02 PM, Alex Williamson wrote: >>>> On Tue, 13 Dec 2022 07:46:56 -0800 >>>> Steve Sistare <steven.sistare@xxxxxxxxxx> wrote: >>>> >>>>> When a vfio container is preserved across exec using the VFIO_UPDATE_VADDR >>>>> interfaces, locked_vm of the new mm becomes 0. If the user later unmaps a >>>>> dma mapping, locked_vm underflows to a large unsigned value, and a >>>>> subsequent dma map request fails with ENOMEM in __account_locked_vm. >>>>> >>>>> To fix, when VFIO_DMA_MAP_FLAG_VADDR is used and the dma's mm has changed, >>>>> add the mapping's pinned page count to the new mm->locked_vm, subject to >>>>> the rlimit. Now that mediated devices are excluded when using >>>>> VFIO_UPDATE_VADDR, the amount of pinned memory equals the size of the >>>>> mapping. >>>>> >>>>> Underflow will not occur when all dma mappings are invalidated before exec. >>>>> An attempt to unmap before updating the vaddr with VFIO_DMA_MAP_FLAG_VADDR >>>>> will fail with EINVAL because the mapping is in the vaddr_invalid state. >>>> >>>> Where is this enforced? >>> >>> In vfio_dma_do_unmap: >>> if (invalidate_vaddr) { >>> if (dma->vaddr_invalid) { >>> ... >>> ret = -EINVAL; >> >> My bad, this is a different case, and my comment in the commit message is >> incorrect. I should test mm != dma->mm during unmap as well, and suppress >> the locked_vm deduction there. > > I'm getting confused how this patch actually does anything. We grab > the mm of the task doing mappings, and we swap that grab when updating > the vaddr, but vfio_lock_acct() uses the original dma->task mm for > accounting. Therefore how can an underflow occur? It seems we're > simply failing to adjust locked_vm for the new mm at all. The old code saves dma->task, but not dma->task->mm. The task's mm changes across exec. >>>>> Underflow may still occur in a buggy application that fails to invalidate >>>>> all before exec. >>>>> >>>>> Signed-off-by: Steve Sistare <steven.sistare@xxxxxxxxxx> >>>>> --- >>>>> drivers/vfio/vfio_iommu_type1.c | 11 +++++++++++ >>>>> 1 file changed, 11 insertions(+) >>>>> >>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c >>>>> index f81e925..e5a02f8 100644 >>>>> --- a/drivers/vfio/vfio_iommu_type1.c >>>>> +++ b/drivers/vfio/vfio_iommu_type1.c >>>>> @@ -100,6 +100,7 @@ struct vfio_dma { >>>>> struct task_struct *task; >>>>> struct rb_root pfn_list; /* Ex-user pinned pfn list */ >>>>> unsigned long *bitmap; >>>>> + struct mm_struct *mm; >>>>> }; >>>>> >>>>> struct vfio_batch { >>>>> @@ -1174,6 +1175,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma) >>>>> vfio_unmap_unpin(iommu, dma, true); >>>>> vfio_unlink_dma(iommu, dma); >>>>> put_task_struct(dma->task); >>>>> + mmdrop(dma->mm); >>>>> vfio_dma_bitmap_free(dma); >>>>> if (dma->vaddr_invalid) { >>>>> iommu->vaddr_invalid_count--; >>>>> @@ -1622,6 +1624,13 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu, >>>>> dma->vaddr = vaddr; >>>>> dma->vaddr_invalid = false; >>>>> iommu->vaddr_invalid_count--; >>>>> + if (current->mm != dma->mm) { >>>>> + mmdrop(dma->mm); >>>>> + dma->mm = current->mm; >>>>> + mmgrab(dma->mm); >>>>> + ret = vfio_lock_acct(dma, size >> PAGE_SHIFT, >>>>> + 0); >>>> >>>> What does it actually mean if this fails? The pages are still pinned. >>>> lock_vm doesn't get updated. Underflow can still occur. Thanks, >>> >>> If this fails, the user has locked additional memory after exec and before making >>> this call -- more than was locked before exec -- and the rlimit is exceeded. >>> A misbehaving application, which will only hurt itself. >>> >>> However, I should reorder these, and check ret before changing the other state. > > The result would then be that the mapping remains with vaddr_invalid on > error? Thanks, Correct. In theory the app could recover by releasing the extra locked memory that it grabbed, or increase its rlimit, and then try map_flag_vaddr again. - Steve >>>>> + } >>>>> wake_up_all(&iommu->vaddr_wait); >>>>> } >>>>> goto out_unlock; >>>>> @@ -1679,6 +1688,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu, >>>>> get_task_struct(current->group_leader); >>>>> dma->task = current->group_leader; >>>>> dma->lock_cap = capable(CAP_IPC_LOCK); >>>>> + dma->mm = dma->task->mm; >>>>> + mmgrab(dma->mm); >>>>> >>>>> dma->pfn_list = RB_ROOT; >>>>> >>>> >> >