RE: [PATCH] drm/amdkfd: fix a potential cu_mask memory leak

"Yu, Lang" <Lang.Yu@xxxxxxx> · Thu, 30 Sep 2021 02:23:17 +0000

>-----Original Message-----
>From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
>Sent: Thursday, September 30, 2021 9:47 AM
>To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Ray
><Ray.Huang@xxxxxxx>
>Subject: Re: [PATCH] drm/amdkfd: fix a potential cu_mask memory leak
>
>On 2021-09-29 7:32 p.m., Yu, Lang wrote:
>> [AMD Official Use Only]
>>
>>
>>
>>> -----Original Message-----
>>> From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
>>> Sent: Wednesday, September 29, 2021 11:25 PM
>>> To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Ray
>>> <Ray.Huang@xxxxxxx>
>>> Subject: Re: [PATCH] drm/amdkfd: fix a potential cu_mask memory leak
>>>
>>> Am 2021-09-29 um 4:22 a.m. schrieb Lang Yu:
>>>> If user doesn't explicitly call kfd_ioctl_destroy_queue to destroy
>>>> all created queues, when the kfd process is destroyed, some queues'
>>>> cu_mask memory are not freed.
>>>>
>>>> To avoid forgetting to free them in some places, free them
>>>> immediately after use.
>>>>
>>>> Signed-off-by: Lang Yu <lang.yu@xxxxxxx>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c               |  8 ++++----
>>>>   drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 10
>>>> ++++------
>>>>   2 files changed, 8 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>>>> index 4de907f3e66a..5c0e6dcf692a 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
>>>> @@ -451,8 +451,8 @@ static int kfd_ioctl_set_cu_mask(struct file
>>>> *filp, struct
>>> kfd_process *p,
>>>>   	retval = copy_from_user(properties.cu_mask, cu_mask_ptr,
>>> cu_mask_size);
>>>>   	if (retval) {
>>>>   		pr_debug("Could not copy CU mask from userspace");
>>>> -		kfree(properties.cu_mask);
>>>> -		return -EFAULT;
>>>> +		retval = -EFAULT;
>>>> +		goto out;
>>>>   	}
>>>>
>>>>   	mutex_lock(&p->mutex);
>>>> @@ -461,8 +461,8 @@ static int kfd_ioctl_set_cu_mask(struct file
>>>> *filp, struct kfd_process *p,
>>>>
>>>>   	mutex_unlock(&p->mutex);
>>>>
>>>> -	if (retval)
>>>> -		kfree(properties.cu_mask);
>>>> +out:
>>>> +	kfree(properties.cu_mask);
>>>>
>>>>   	return retval;
>>>>   }
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>>>> index 243dd1efcdbf..4c81d690f31a 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
>>>> @@ -394,8 +394,6 @@ int pqm_destroy_queue(struct
>>> process_queue_manager *pqm, unsigned int qid)
>>>>   			pdd->qpd.num_gws = 0;
>>>>   		}
>>>>
>>>> -		kfree(pqn->q->properties.cu_mask);
>>>> -		pqn->q->properties.cu_mask = NULL;
>>>>   		uninit_queue(pqn->q);
>>>>   	}
>>>>
>>>> @@ -448,16 +446,16 @@ int pqm_set_cu_mask(struct
>>> process_queue_manager *pqm, unsigned int qid,
>>>>   		return -EFAULT;
>>>>   	}
>>>>
>>>> -	/* Free the old CU mask memory if it is already allocated, then
>>>> -	 * allocate memory for the new CU mask.
>>>> -	 */
>>>> -	kfree(pqn->q->properties.cu_mask);
>>>> +	WARN_ON_ONCE(pqn->q->properties.cu_mask);
>>>>
>>>>   	pqn->q->properties.cu_mask_count = p->cu_mask_count;
>>>>   	pqn->q->properties.cu_mask = p->cu_mask;
>>>>
>>>>   	retval = pqn->q->device->dqm->ops.update_queue(pqn->q->device->dqm,
>>>>   							pqn->q);
>>>> +
>>>> +	pqn->q->properties.cu_mask = NULL;
>>>> +
>>> This won't work correctly. We need to save the cu_mask for later.
>>> Otherwise the next time dqm->ops.update_queue is called, for example
>>> in pqm_update_queue or pqm_set_gws, it will wipe out the CU mask in the
>MQD.
>> Let's just return when meeting a null cu_mask in update_cu_mask() to avoid
>that.
>> Like following,
>>
>> static void update_cu_mask(struct mqd_manager *mm, void *mqd,
>> 			   struct queue_properties *q)
>> {
>> 	struct v10_compute_mqd *m;
>> 	uint32_t se_mask[4] = {0}; /* 4 is the max # of SEs */
>>
>> 	if (!q-> cu_mask || q->cu_mask_count == 0)
>> 		return;
>> 	......
>> }
>>
>> Is this fine with you? Thanks!
>
>I think that could work. I still don't like it. It leaves the CU mask in the q-
>>properties structure, but it's only ever used temporarily and doesn't need to be
>persistent. I'd argue, in this case, the cu_mask shouldn't be in the q->properties
>structure at all, but should be passed as an optional parameter into the dqm-
>>ops.update_queue call.

The cu_mask is originally in q->properties structure. I didn't change that.
What I want to do is keeping the cu_mask memory allocation and deallocation just in kfd_ioctl_set_cu_mask.
instead of everywhere.

>But I think a simpler fix would be to move the freeing of the CU mask into
>uninit_queue. That would catch all cases where a queue gets destroyed, including
>the process termination case.

Yes, it' better to free queue related resource in one function.
But we allocate it in kfd_ioctl_set_cu_mask and free it in pqm_set_cu_mask, uninit_queue and so on.
It seems strange and error-prone. Unless we allocate it in create queue.  As you said, it is temporary.
It's not worth keeping it until queue is destroyed. Thanks.

Regards,
Lang

>Regards,
>   Felix
>
>
>>
>> Regards,
>> Lang
>>
>>> Regards,
>>>    Felix
>>>
>>>
>>>>   	if (retval != 0)
>>>>   		return retval;
>>>>