[AMD Official Use Only - AMD Internal Distribution Only] > -----Original Message----- > From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> > Sent: Friday, October 11, 2024 11:17 AM > To: Joshi, Mukul <Mukul.Joshi@xxxxxxx>; Li, Yunxiang (Teddy) > <Yunxiang.Li@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian > <Christian.Koenig@xxxxxxx>; Gui, Jack <Jack.Gui@xxxxxxx> > Subject: Re: [PATCH v2 2/2] drm/amdkfd: pause autosuspend when creating pdd > > > On 2024-10-11 11:07, Joshi, Mukul wrote: > > [AMD Official Use Only - AMD Internal Distribution Only] > > > >> -----Original Message----- > >> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of > >> Yunxiang Li > >> Sent: Thursday, October 10, 2024 12:18 PM > >> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx > >> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian > >> <Christian.Koenig@xxxxxxx>; Gui, Jack <Jack.Gui@xxxxxxx>; Li, > >> Yunxiang > >> (Teddy) <Yunxiang.Li@xxxxxxx> > >> Subject: [PATCH v2 2/2] drm/amdkfd: pause autosuspend when creating > >> pdd > >> > >> When using MES creating a pdd will require talking to the GPU to > >> setup the relevant context. The code here forgot to wake up the GPU > >> in case it was in suspend, this causes KVM to EFAULT for passthrough GPU > for example. > >> > >> Also, change the other place where we pause suspend to use the > >> cleaner pm_runtime_resume_and_get helper. > >> > >> Fixes: cc009e613de6 ("drm/amdkfd: Add KFD support for soc21 v3") > >> Signed-off-by: Yunxiang Li <Yunxiang.Li@xxxxxxx> > >> --- > >> > >> It is unclear to me if kfd_process_destroy_pdds also have this > >> problem, or is freeing gtt mem guaranteed to succeed even with the GPU in > suspend. > >> > >> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 13 +++++++++---- > >> 1 file changed, 9 insertions(+), 4 deletions(-) > >> > >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c > >> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c > >> index c9ce95bc3ec8e..c3d510f1dabfa 100644 > >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c > >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c > >> @@ -1633,12 +1633,19 @@ struct kfd_process_device > >> *kfd_create_process_device_data(struct kfd_node *dev, > >> atomic64_set(&pdd->evict_duration_counter, 0); > >> > >> if (dev->kfd->shared_resources.enable_mes) { > >> + retval = pm_runtime_resume_and_get(bdev); > >> + if (retval < 0) { > >> + pr_err("failed to stop autosuspend\n"); > >> + goto err_free_pdd; > >> + } > > > > I am not 100% sure if it makes sense to resume here and then put it > > back on suspend queue after the allocation. And then again call > pm_runtime_resume in kfd_bind_process_to_device(). > > Maybe you can set pdd->runtime_inuse = true here itself and not call the > pm_runtime_put_autosuspend here. > > > > I will probably let @Kuehling, Felix comment on this. > > I don't think we want to set pdd->runtime_inuse here. That would make all GPUs > like they're always in use, no matter if the application is actually using them. I think > this part of the change is OK. > Ok thanks. I got confused and thought we would call kfd_bind_to_process() everytime. But that’s not the case. > > > > Regards, > > Mukul > > > >> retval = amdgpu_amdkfd_alloc_gtt_mem(adev, > >> AMDGPU_MES_PROC_CTX_SIZE, > >> &pdd->proc_ctx_bo, > >> &pdd->proc_ctx_gpu_addr, > >> &pdd->proc_ctx_cpu_ptr, > >> false); > >> + pm_runtime_mark_last_busy(bdev); > >> + pm_runtime_put_autosuspend(bdev); > >> if (retval) { > >> dev_err(bdev, > >> "failed to allocate process context > >> bo\n"); @@ - > >> 1768,11 +1775,9 @@ struct kfd_process_device > >> *kfd_bind_process_to_device(struct kfd_node *dev, > >> * pdd is destroyed. > >> */ > >> if (!pdd->runtime_inuse) { > >> - err = pm_runtime_get_sync(bdev); > >> - if (err < 0) { > >> - pm_runtime_put_autosuspend(adev_to_drm(dev->adev)- > >>> dev); > >> + err = pm_runtime_resume_and_get(bdev); > > I got confused here, looking at the number of interfaces available in the pm_runtime > header file. There seem to be many ways to do the same thing. There may be > subtle difference between them. It's not obvious to me that this change is correct or > harmless. > I think it is OK. The function description above pm_runtime_get_sync() says to consider using pm_runtime_resume_and_get() instead. Regards, Mukul > Regards, > Felix > > > >> + if (err < 0) > >> return ERR_PTR(err); > >> - } > >> } > >> > >> /* > >> -- > >> 2.34.1 > >