>-----Original Message----- >From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> >Sent: Thursday, September 30, 2021 9:47 AM >To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Ray ><Ray.Huang@xxxxxxx> >Subject: Re: [PATCH] drm/amdkfd: fix a potential cu_mask memory leak > >On 2021-09-29 7:32 p.m., Yu, Lang wrote: >> [AMD Official Use Only] >> >> >> >>> -----Original Message----- >>> From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> >>> Sent: Wednesday, September 29, 2021 11:25 PM >>> To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Ray >>> <Ray.Huang@xxxxxxx> >>> Subject: Re: [PATCH] drm/amdkfd: fix a potential cu_mask memory leak >>> >>> Am 2021-09-29 um 4:22 a.m. schrieb Lang Yu: >>>> If user doesn't explicitly call kfd_ioctl_destroy_queue to destroy >>>> all created queues, when the kfd process is destroyed, some queues' >>>> cu_mask memory are not freed. >>>> >>>> To avoid forgetting to free them in some places, free them >>>> immediately after use. >>>> >>>> Signed-off-by: Lang Yu <lang.yu@xxxxxxx> >>>> --- >>>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 8 ++++---- >>>> drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 10 >>>> ++++------ >>>> 2 files changed, 8 insertions(+), 10 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c >>>> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c >>>> index 4de907f3e66a..5c0e6dcf692a 100644 >>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c >>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c >>>> @@ -451,8 +451,8 @@ static int kfd_ioctl_set_cu_mask(struct file >>>> *filp, struct >>> kfd_process *p, >>>> retval = copy_from_user(properties.cu_mask, cu_mask_ptr, >>> cu_mask_size); >>>> if (retval) { >>>> pr_debug("Could not copy CU mask from userspace"); >>>> - kfree(properties.cu_mask); >>>> - return -EFAULT; >>>> + retval = -EFAULT; >>>> + goto out; >>>> } >>>> >>>> mutex_lock(&p->mutex); >>>> @@ -461,8 +461,8 @@ static int kfd_ioctl_set_cu_mask(struct file >>>> *filp, struct kfd_process *p, >>>> >>>> mutex_unlock(&p->mutex); >>>> >>>> - if (retval) >>>> - kfree(properties.cu_mask); >>>> +out: >>>> + kfree(properties.cu_mask); >>>> >>>> return retval; >>>> } >>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c >>>> b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c >>>> index 243dd1efcdbf..4c81d690f31a 100644 >>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c >>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c >>>> @@ -394,8 +394,6 @@ int pqm_destroy_queue(struct >>> process_queue_manager *pqm, unsigned int qid) >>>> pdd->qpd.num_gws = 0; >>>> } >>>> >>>> - kfree(pqn->q->properties.cu_mask); >>>> - pqn->q->properties.cu_mask = NULL; >>>> uninit_queue(pqn->q); >>>> } >>>> >>>> @@ -448,16 +446,16 @@ int pqm_set_cu_mask(struct >>> process_queue_manager *pqm, unsigned int qid, >>>> return -EFAULT; >>>> } >>>> >>>> - /* Free the old CU mask memory if it is already allocated, then >>>> - * allocate memory for the new CU mask. >>>> - */ >>>> - kfree(pqn->q->properties.cu_mask); >>>> + WARN_ON_ONCE(pqn->q->properties.cu_mask); >>>> >>>> pqn->q->properties.cu_mask_count = p->cu_mask_count; >>>> pqn->q->properties.cu_mask = p->cu_mask; >>>> >>>> retval = pqn->q->device->dqm->ops.update_queue(pqn->q->device->dqm, >>>> pqn->q); >>>> + >>>> + pqn->q->properties.cu_mask = NULL; >>>> + >>> This won't work correctly. We need to save the cu_mask for later. >>> Otherwise the next time dqm->ops.update_queue is called, for example >>> in pqm_update_queue or pqm_set_gws, it will wipe out the CU mask in the >MQD. >> Let's just return when meeting a null cu_mask in update_cu_mask() to avoid >that. >> Like following, >> >> static void update_cu_mask(struct mqd_manager *mm, void *mqd, >> struct queue_properties *q) >> { >> struct v10_compute_mqd *m; >> uint32_t se_mask[4] = {0}; /* 4 is the max # of SEs */ >> >> if (!q-> cu_mask || q->cu_mask_count == 0) >> return; >> ...... >> } >> >> Is this fine with you? Thanks! > >I think that could work. I still don't like it. It leaves the CU mask in the q- >>properties structure, but it's only ever used temporarily and doesn't need to be >persistent. I'd argue, in this case, the cu_mask shouldn't be in the q->properties >structure at all, but should be passed as an optional parameter into the dqm- >>ops.update_queue call. The cu_mask is originally in q->properties structure. I didn't change that. What I want to do is keeping the cu_mask memory allocation and deallocation just in kfd_ioctl_set_cu_mask. instead of everywhere. >But I think a simpler fix would be to move the freeing of the CU mask into >uninit_queue. That would catch all cases where a queue gets destroyed, including >the process termination case. Yes, it' better to free queue related resource in one function. But we allocate it in kfd_ioctl_set_cu_mask and free it in pqm_set_cu_mask, uninit_queue and so on. It seems strange and error-prone. Unless we allocate it in create queue. As you said, it is temporary. It's not worth keeping it until queue is destroyed. Thanks. Regards, Lang >Regards, > Felix > > >> >> Regards, >> Lang >> >>> Regards, >>> Felix >>> >>> >>>> if (retval != 0) >>>> return retval; >>>>