Re: [PATCH V4 2/5] vfio/type1: prevent locked_vm underflow

Steven Sistare <steven.sistare@xxxxxxxxxx> · Thu, 15 Dec 2022 09:33:32 -0500

On 12/14/2022 11:52 PM, Tian, Kevin wrote:
>> From: Steve Sistare <steven.sistare@xxxxxxxxxx>
>> Sent: Thursday, December 15, 2022 5:25 AM
>>
>> When a vfio container is preserved across exec using the
>> VFIO_UPDATE_VADDR
>> interfaces, locked_vm of the new mm becomes 0.  If the user later unmaps a
>> dma mapping, locked_vm underflows to a large unsigned value, and a
>> subsequent dma map request fails with ENOMEM in __account_locked_vm.
> 
> Isn't it the problem more than underflow? w/o copying locked_vm to the
> new mm it means rlimit of the new mm could be exceeded since only
> newly pinned pages after exec are counted.

Yes.  I focused on underflow, because that will even cause an app that has
no limit to fail, but I will add rlimit to the commit message.

>> @@ -424,6 +425,10 @@ static int vfio_lock_acct(struct vfio_dma *dma, long
>> npage, bool async)
>>  	if (!mm)
>>  		return -ESRCH; /* process exited */
>>
>> +	/* Avoid locked_vm underflow */
>> +	if (dma->mm != mm && npage < 0)
>> +		goto out;
>> +
> 
> the comment is unclear why the condition will lead to underflow.
> 
> It's also unclear why the guard is only for exec case but not fork-exec.
> According to this patch locked_vm was not copied to the new mm in
> both cases before this fix. 

Correct.

> Then why would underflow only happen
> in one but not in the other?

Underflow does not occur for fork-exec.  I was wrong when I said so in email.

For fork-exec, if the original process has exited, then we return ESRCH.
If the original process is alive, then dma->mm == mm, and we proceed to 
decrement its locked_vm.

> Anyway more explanation is appreciated here.

How about:
   /* If task exec'd, the old mm is dead.  Avoid locked_mm underflow. */

>> +static int vfio_change_dma_owner(struct vfio_dma *dma)
>> +{
>> +	int ret = 0;
>> +	struct mm_struct *mm = get_task_mm(dma->task);
> 
> should this be current->group_leader? otherwise in fork-exec case
> dma->mm is still equal to dma->task->mm.

Thanks, that is a bug.  The code works if the original process exited, but
not if it persists.  The test should be:

        if (dma->mm != mm || new_task != dma->task) {

>> +
>> +	if (dma->mm != mm) {
>> +		long npage = dma->size >> PAGE_SHIFT;
>> +		bool new_lock_cap = capable(CAP_IPC_LOCK);
>> +		struct task_struct *new_task = current->group_leader;
>> +
>> +		ret = mmap_write_lock_killable(new_task->mm);
>> +		if (ret)
>> +			goto out;
>> +
>> +		ret = __account_locked_vm(new_task->mm, npage, true,
>> +					  new_task, new_lock_cap);
>> +		mmap_write_unlock(new_task->mm);
>> +		if (ret)
>> +			goto out;
>> +
>> +		if (dma->task != new_task) {
>> +			vfio_lock_acct(dma, -npage, 0);
> 
> Presumably this should be always done on the old mm, even in exec case.

No, for exec, the old mm is a zombie.  

- Steve