On Tue, 13 Dec 2022 13:21:15 -0500 Steven Sistare <steven.sistare@xxxxxxxxxx> wrote: > On 12/13/2022 1:17 PM, Steven Sistare wrote: > > On 12/13/2022 1:02 PM, Alex Williamson wrote: > >> On Tue, 13 Dec 2022 07:46:56 -0800 > >> Steve Sistare <steven.sistare@xxxxxxxxxx> wrote: > >> > >>> When a vfio container is preserved across exec using the VFIO_UPDATE_VADDR > >>> interfaces, locked_vm of the new mm becomes 0. If the user later unmaps a > >>> dma mapping, locked_vm underflows to a large unsigned value, and a > >>> subsequent dma map request fails with ENOMEM in __account_locked_vm. > >>> > >>> To fix, when VFIO_DMA_MAP_FLAG_VADDR is used and the dma's mm has changed, > >>> add the mapping's pinned page count to the new mm->locked_vm, subject to > >>> the rlimit. Now that mediated devices are excluded when using > >>> VFIO_UPDATE_VADDR, the amount of pinned memory equals the size of the > >>> mapping. > >>> > >>> Underflow will not occur when all dma mappings are invalidated before exec. > >>> An attempt to unmap before updating the vaddr with VFIO_DMA_MAP_FLAG_VADDR > >>> will fail with EINVAL because the mapping is in the vaddr_invalid state. > >> > >> Where is this enforced? > > > > In vfio_dma_do_unmap: > > if (invalidate_vaddr) { > > if (dma->vaddr_invalid) { > > ... > > ret = -EINVAL; > > My bad, this is a different case, and my comment in the commit message is > incorrect. I should test mm != dma->mm during unmap as well, and suppress > the locked_vm deduction there. I'm getting confused how this patch actually does anything. We grab the mm of the task doing mappings, and we swap that grab when updating the vaddr, but vfio_lock_acct() uses the original dma->task mm for accounting. Therefore how can an underflow occur? It seems we're simply failing to adjust locked_vm for the new mm at all. > >>> Underflow may still occur in a buggy application that fails to invalidate > >>> all before exec. > >>> > >>> Signed-off-by: Steve Sistare <steven.sistare@xxxxxxxxxx> > >>> --- > >>> drivers/vfio/vfio_iommu_type1.c | 11 +++++++++++ > >>> 1 file changed, 11 insertions(+) > >>> > >>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > >>> index f81e925..e5a02f8 100644 > >>> --- a/drivers/vfio/vfio_iommu_type1.c > >>> +++ b/drivers/vfio/vfio_iommu_type1.c > >>> @@ -100,6 +100,7 @@ struct vfio_dma { > >>> struct task_struct *task; > >>> struct rb_root pfn_list; /* Ex-user pinned pfn list */ > >>> unsigned long *bitmap; > >>> + struct mm_struct *mm; > >>> }; > >>> > >>> struct vfio_batch { > >>> @@ -1174,6 +1175,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma) > >>> vfio_unmap_unpin(iommu, dma, true); > >>> vfio_unlink_dma(iommu, dma); > >>> put_task_struct(dma->task); > >>> + mmdrop(dma->mm); > >>> vfio_dma_bitmap_free(dma); > >>> if (dma->vaddr_invalid) { > >>> iommu->vaddr_invalid_count--; > >>> @@ -1622,6 +1624,13 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu, > >>> dma->vaddr = vaddr; > >>> dma->vaddr_invalid = false; > >>> iommu->vaddr_invalid_count--; > >>> + if (current->mm != dma->mm) { > >>> + mmdrop(dma->mm); > >>> + dma->mm = current->mm; > >>> + mmgrab(dma->mm); > >>> + ret = vfio_lock_acct(dma, size >> PAGE_SHIFT, > >>> + 0); > >> > >> What does it actually mean if this fails? The pages are still pinned. > >> lock_vm doesn't get updated. Underflow can still occur. Thanks, > > > > If this fails, the user has locked additional memory after exec and before making > > this call -- more than was locked before exec -- and the rlimit is exceeded. > > A misbehaving application, which will only hurt itself. > > > > However, I should reorder these, and check ret before changing the other state. The result would then be that the mapping remains with vaddr_invalid on error? Thanks, Alex > >>> + } > >>> wake_up_all(&iommu->vaddr_wait); > >>> } > >>> goto out_unlock; > >>> @@ -1679,6 +1688,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu, > >>> get_task_struct(current->group_leader); > >>> dma->task = current->group_leader; > >>> dma->lock_cap = capable(CAP_IPC_LOCK); > >>> + dma->mm = dma->task->mm; > >>> + mmgrab(dma->mm); > >>> > >>> dma->pfn_list = RB_ROOT; > >>> > >> >