Re: [PATCH V1 2/2] vfio/type1: prevent locked_vm underflow

Steven Sistare <steven.sistare@xxxxxxxxxx> · Tue, 13 Dec 2022 13:21:15 -0500

On 12/13/2022 1:17 PM, Steven Sistare wrote:
> On 12/13/2022 1:02 PM, Alex Williamson wrote:
>> On Tue, 13 Dec 2022 07:46:56 -0800
>> Steve Sistare <steven.sistare@xxxxxxxxxx> wrote:
>>
>>> When a vfio container is preserved across exec using the VFIO_UPDATE_VADDR
>>> interfaces, locked_vm of the new mm becomes 0.  If the user later unmaps a
>>> dma mapping, locked_vm underflows to a large unsigned value, and a
>>> subsequent dma map request fails with ENOMEM in __account_locked_vm.
>>>
>>> To fix, when VFIO_DMA_MAP_FLAG_VADDR is used and the dma's mm has changed,
>>> add the mapping's pinned page count to the new mm->locked_vm, subject to
>>> the rlimit.  Now that mediated devices are excluded when using
>>> VFIO_UPDATE_VADDR, the amount of pinned memory equals the size of the
>>> mapping.
>>>
>>> Underflow will not occur when all dma mappings are invalidated before exec.
>>> An attempt to unmap before updating the vaddr with VFIO_DMA_MAP_FLAG_VADDR
>>> will fail with EINVAL because the mapping is in the vaddr_invalid state.
>>
>> Where is this enforced?
> 
> In vfio_dma_do_unmap:
>         if (invalidate_vaddr) {
>                 if (dma->vaddr_invalid) {
>                         ...
>                         ret = -EINVAL;

My bad, this is a different case, and my comment in the commit message is
incorrect.  I should test mm != dma->mm during unmap as well, and suppress
the locked_vm deduction there.

- Steve

>>> Underflow may still occur in a buggy application that fails to invalidate
>>> all before exec.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@xxxxxxxxxx>
>>> ---
>>>  drivers/vfio/vfio_iommu_type1.c | 11 +++++++++++
>>>  1 file changed, 11 insertions(+)
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index f81e925..e5a02f8 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -100,6 +100,7 @@ struct vfio_dma {
>>>  	struct task_struct	*task;
>>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>>  	unsigned long		*bitmap;
>>> +	struct mm_struct	*mm;
>>>  };
>>>  
>>>  struct vfio_batch {
>>> @@ -1174,6 +1175,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>>  	vfio_unmap_unpin(iommu, dma, true);
>>>  	vfio_unlink_dma(iommu, dma);
>>>  	put_task_struct(dma->task);
>>> +	mmdrop(dma->mm);
>>>  	vfio_dma_bitmap_free(dma);
>>>  	if (dma->vaddr_invalid) {
>>>  		iommu->vaddr_invalid_count--;
>>> @@ -1622,6 +1624,13 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>>  			dma->vaddr = vaddr;
>>>  			dma->vaddr_invalid = false;
>>>  			iommu->vaddr_invalid_count--;
>>> +			if (current->mm != dma->mm) {
>>> +				mmdrop(dma->mm);
>>> +				dma->mm = current->mm;
>>> +				mmgrab(dma->mm);
>>> +				ret = vfio_lock_acct(dma, size >> PAGE_SHIFT,
>>> +						     0);
>>
>> What does it actually mean if this fails?  The pages are still pinned.
>> lock_vm doesn't get updated.  Underflow can still occur.  Thanks,
> 
> If this fails, the user has locked additional memory after exec and before making
> this call -- more than was locked before exec -- and the rlimit is exceeded.
> A misbehaving application, which will only hurt itself.
> 
> However, I should reorder these, and check ret before changing the other state.
> 
> - Steve
> 
>>> +			}
>>>  			wake_up_all(&iommu->vaddr_wait);
>>>  		}
>>>  		goto out_unlock;
>>> @@ -1679,6 +1688,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>>  	get_task_struct(current->group_leader);
>>>  	dma->task = current->group_leader;
>>>  	dma->lock_cap = capable(CAP_IPC_LOCK);
>>> +	dma->mm = dma->task->mm;
>>> +	mmgrab(dma->mm);
>>>  
>>>  	dma->pfn_list = RB_ROOT;
>>>  
>>