On 12/13/2022 1:17 PM, Steven Sistare wrote: > On 12/13/2022 1:02 PM, Alex Williamson wrote: >> On Tue, 13 Dec 2022 07:46:56 -0800 >> Steve Sistare <steven.sistare@xxxxxxxxxx> wrote: >> >>> When a vfio container is preserved across exec using the VFIO_UPDATE_VADDR >>> interfaces, locked_vm of the new mm becomes 0. If the user later unmaps a >>> dma mapping, locked_vm underflows to a large unsigned value, and a >>> subsequent dma map request fails with ENOMEM in __account_locked_vm. >>> >>> To fix, when VFIO_DMA_MAP_FLAG_VADDR is used and the dma's mm has changed, >>> add the mapping's pinned page count to the new mm->locked_vm, subject to >>> the rlimit. Now that mediated devices are excluded when using >>> VFIO_UPDATE_VADDR, the amount of pinned memory equals the size of the >>> mapping. >>> >>> Underflow will not occur when all dma mappings are invalidated before exec. >>> An attempt to unmap before updating the vaddr with VFIO_DMA_MAP_FLAG_VADDR >>> will fail with EINVAL because the mapping is in the vaddr_invalid state. >> >> Where is this enforced? > > In vfio_dma_do_unmap: > if (invalidate_vaddr) { > if (dma->vaddr_invalid) { > ... > ret = -EINVAL; My bad, this is a different case, and my comment in the commit message is incorrect. I should test mm != dma->mm during unmap as well, and suppress the locked_vm deduction there. - Steve >>> Underflow may still occur in a buggy application that fails to invalidate >>> all before exec. >>> >>> Signed-off-by: Steve Sistare <steven.sistare@xxxxxxxxxx> >>> --- >>> drivers/vfio/vfio_iommu_type1.c | 11 +++++++++++ >>> 1 file changed, 11 insertions(+) >>> >>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c >>> index f81e925..e5a02f8 100644 >>> --- a/drivers/vfio/vfio_iommu_type1.c >>> +++ b/drivers/vfio/vfio_iommu_type1.c >>> @@ -100,6 +100,7 @@ struct vfio_dma { >>> struct task_struct *task; >>> struct rb_root pfn_list; /* Ex-user pinned pfn list */ >>> unsigned long *bitmap; >>> + struct mm_struct *mm; >>> }; >>> >>> struct vfio_batch { >>> @@ -1174,6 +1175,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma) >>> vfio_unmap_unpin(iommu, dma, true); >>> vfio_unlink_dma(iommu, dma); >>> put_task_struct(dma->task); >>> + mmdrop(dma->mm); >>> vfio_dma_bitmap_free(dma); >>> if (dma->vaddr_invalid) { >>> iommu->vaddr_invalid_count--; >>> @@ -1622,6 +1624,13 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu, >>> dma->vaddr = vaddr; >>> dma->vaddr_invalid = false; >>> iommu->vaddr_invalid_count--; >>> + if (current->mm != dma->mm) { >>> + mmdrop(dma->mm); >>> + dma->mm = current->mm; >>> + mmgrab(dma->mm); >>> + ret = vfio_lock_acct(dma, size >> PAGE_SHIFT, >>> + 0); >> >> What does it actually mean if this fails? The pages are still pinned. >> lock_vm doesn't get updated. Underflow can still occur. Thanks, > > If this fails, the user has locked additional memory after exec and before making > this call -- more than was locked before exec -- and the rlimit is exceeded. > A misbehaving application, which will only hurt itself. > > However, I should reorder these, and check ret before changing the other state. > > - Steve > >>> + } >>> wake_up_all(&iommu->vaddr_wait); >>> } >>> goto out_unlock; >>> @@ -1679,6 +1688,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu, >>> get_task_struct(current->group_leader); >>> dma->task = current->group_leader; >>> dma->lock_cap = capable(CAP_IPC_LOCK); >>> + dma->mm = dma->task->mm; >>> + mmgrab(dma->mm); >>> >>> dma->pfn_list = RB_ROOT; >>> >>