>-----Original Message----- >From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> >Sent: Thursday, September 30, 2021 10:28 AM >To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Ray ><Ray.Huang@xxxxxxx> >Subject: Re: [PATCH] drm/amdkfd: fix a potential cu_mask memory leak > >On 2021-09-29 10:23 p.m., Yu, Lang wrote: >>> -----Original Message----- >>> From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> >>> Sent: Thursday, September 30, 2021 9:47 AM >>> To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Ray >>> <Ray.Huang@xxxxxxx> >>> Subject: Re: [PATCH] drm/amdkfd: fix a potential cu_mask memory leak >>> >>> On 2021-09-29 7:32 p.m., Yu, Lang wrote: >>>> [AMD Official Use Only] >>>> >>>> >>>> >>>>> -----Original Message----- >>>>> From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> >>>>> Sent: Wednesday, September 29, 2021 11:25 PM >>>>> To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Ray >>>>> <Ray.Huang@xxxxxxx> >>>>> Subject: Re: [PATCH] drm/amdkfd: fix a potential cu_mask memory >>>>> leak >>>>> >>>>> Am 2021-09-29 um 4:22 a.m. schrieb Lang Yu: >>>>>> If user doesn't explicitly call kfd_ioctl_destroy_queue to destroy >>>>>> all created queues, when the kfd process is destroyed, some queues' >>>>>> cu_mask memory are not freed. >>>>>> >>>>>> To avoid forgetting to free them in some places, free them >>>>>> immediately after use. >>>>>> >>>>>> Signed-off-by: Lang Yu <lang.yu@xxxxxxx> >>>>>> --- >>>>>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 8 ++++---- >>>>>> drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 10 >>>>>> ++++------ >>>>>> 2 files changed, 8 insertions(+), 10 deletions(-) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c >>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c >>>>>> index 4de907f3e66a..5c0e6dcf692a 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c >>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c >>>>>> @@ -451,8 +451,8 @@ static int kfd_ioctl_set_cu_mask(struct file >>>>>> *filp, struct >>>>> kfd_process *p, >>>>>> retval = copy_from_user(properties.cu_mask, cu_mask_ptr, >>>>> cu_mask_size); >>>>>> if (retval) { >>>>>> pr_debug("Could not copy CU mask from userspace"); >>>>>> - kfree(properties.cu_mask); >>>>>> - return -EFAULT; >>>>>> + retval = -EFAULT; >>>>>> + goto out; >>>>>> } >>>>>> >>>>>> mutex_lock(&p->mutex); >>>>>> @@ -461,8 +461,8 @@ static int kfd_ioctl_set_cu_mask(struct file >>>>>> *filp, struct kfd_process *p, >>>>>> >>>>>> mutex_unlock(&p->mutex); >>>>>> >>>>>> - if (retval) >>>>>> - kfree(properties.cu_mask); >>>>>> +out: >>>>>> + kfree(properties.cu_mask); >>>>>> >>>>>> return retval; >>>>>> } >>>>>> diff --git >>>>>> a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c >>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c >>>>>> index 243dd1efcdbf..4c81d690f31a 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c >>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c >>>>>> @@ -394,8 +394,6 @@ int pqm_destroy_queue(struct >>>>> process_queue_manager *pqm, unsigned int qid) >>>>>> pdd->qpd.num_gws = 0; >>>>>> } >>>>>> >>>>>> - kfree(pqn->q->properties.cu_mask); >>>>>> - pqn->q->properties.cu_mask = NULL; >>>>>> uninit_queue(pqn->q); >>>>>> } >>>>>> >>>>>> @@ -448,16 +446,16 @@ int pqm_set_cu_mask(struct >>>>> process_queue_manager *pqm, unsigned int qid, >>>>>> return -EFAULT; >>>>>> } >>>>>> >>>>>> - /* Free the old CU mask memory if it is already allocated, then >>>>>> - * allocate memory for the new CU mask. >>>>>> - */ >>>>>> - kfree(pqn->q->properties.cu_mask); >>>>>> + WARN_ON_ONCE(pqn->q->properties.cu_mask); >>>>>> >>>>>> pqn->q->properties.cu_mask_count = p->cu_mask_count; >>>>>> pqn->q->properties.cu_mask = p->cu_mask; >>>>>> >>>>>> retval = pqn->q->device->dqm->ops.update_queue(pqn->q- >>device->dqm, >>>>>> pqn->q); >>>>>> + >>>>>> + pqn->q->properties.cu_mask = NULL; >>>>>> + >>>>> This won't work correctly. We need to save the cu_mask for later. >>>>> Otherwise the next time dqm->ops.update_queue is called, for >>>>> example in pqm_update_queue or pqm_set_gws, it will wipe out the CU >>>>> mask in the >>> MQD. >>>> Let's just return when meeting a null cu_mask in update_cu_mask() to >>>> avoid >>> that. >>>> Like following, >>>> >>>> static void update_cu_mask(struct mqd_manager *mm, void *mqd, >>>> struct queue_properties *q) >>>> { >>>> struct v10_compute_mqd *m; >>>> uint32_t se_mask[4] = {0}; /* 4 is the max # of SEs */ >>>> >>>> if (!q-> cu_mask || q->cu_mask_count == 0) >>>> return; >>>> ...... >>>> } >>>> >>>> Is this fine with you? Thanks! >>> I think that could work. I still don't like it. It leaves the CU mask >>> in the q- >>>> properties structure, but it's only ever used temporarily and >>>> doesn't need to be >>> persistent. I'd argue, in this case, the cu_mask shouldn't be in the >>> q->properties structure at all, but should be passed as an optional >>> parameter into the dqm- >>>> ops.update_queue call. >> The cu_mask is originally in q->properties structure. I didn't change that. >> What I want to do is keeping the cu_mask memory allocation and deallocation >just in kfd_ioctl_set_cu_mask. >> instead of everywhere. > >You're not changing where it is stored. But you're changing it from something >persistent (while the queue exists) to something ephemeral (while the ioctl call is >executing in the kernel). I think that would justify removing the persistent pointer >from the q->properties structure. Mmm, actually it has been copied to mqd memory. It doesn't make too much sense to keep it persistent. >> >>> But I think a simpler fix would be to move the freeing of the CU mask >>> into uninit_queue. That would catch all cases where a queue gets >>> destroyed, including the process termination case. >> Yes, it' better to free queue related resource in one function. >> But we allocate it in kfd_ioctl_set_cu_mask and free it in pqm_set_cu_mask, >uninit_queue and so on. > >My proposal here is to only free it in uninit_queue (when the queue is >destroyed) or in pqm_set_cu_mask (when it is replaced by a different cu_mask). Anyway, I still think it's strange and error-pone to allocate it in kfd_ioctl_set_cu_mask and free it in uninit_queue and pqm_set_cu_mask. Regards, Lang >Regards, > Felix > > >> It seems strange and error-prone. Unless we allocate it in create queue. As you >said, it is temporary. >> It's not worth keeping it until queue is destroyed. Thanks. >> >> Regards, >> Lang >> >>> Regards, >>> Felix >>> >>> >>>> Regards, >>>> Lang >>>> >>>>> Regards, >>>>> Felix >>>>> >>>>> >>>>>> if (retval != 0) >>>>>> return retval; >>>>>>