Yes, understood . I already rebased the changes on amd-kfd-staging and add you and Christian as reviewer. Regards Shaoyun.liu -----Original Message----- From: Kuehling, Felix Sent: Friday, January 26, 2018 2:47 PM To: Liu, Shaoyun; Koenig, Christian; amd-gfx at lists.freedesktop.org Subject: Re: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset Hi Shaoyun, As I'm upstreaming GPUVM support for KFD right now, a bunch of amdgpu_amdkfd_... changes are going to go through Oded's tree. To avoid conflicts when merging with Alex's tree, I'd recommend we don't submit your changes to amd-staging-drm-next. Instead submit them to amd-kfd-staging, and I can take care of upstreaming them via Oded later. We can still do code review on the amd-gfx mailing list to get valuable feedback. Regards,  Felix On 2018-01-26 01:53 PM, Liu, Shaoyun wrote: > It could be done in next step . > I just notice my change is based on amd-kfd-staging , I will add you as the reviewer again . > > Regards > Shaoyun.liu > > -----Original Message----- > From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com] > Sent: Friday, January 26, 2018 1:42 PM > To: Liu, Shaoyun; Koenig, Christian; amd-gfx at lists.freedesktop.org > Subject: Re: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset > > Alternatively if you really want to add some distinction here you could add an enum like trigger source. > > Something like manual, SRIOV host, GPU scheduler, KFD, interrupt etc... > And then use the user configurable option as bitmask to enable/disable GPU recovery for each trigger source. > > Regards, > Christian. > > Am 26.01.2018 um 19:37 schrieb Liu, Shaoyun: >> Ok, the wrong hang detection when amdgpu_gpu_recovery is enabled may be another issue , we can fix it later . >> >> Changed to 'false' flag as suggested and submitted . >> >> Regards >> Shaoyun.liu >> >> >> -----Original Message----- >> From: Koenig, Christian >> Sent: Friday, January 26, 2018 1:28 PM >> To: Liu, Shaoyun; amd-gfx at lists.freedesktop.org >> Subject: Re: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset >> >>> Sorry , I meant if I use the "false" flag and gpu_recovery is not enabled , the reset will be ignored. >> And exactly that is the intention here. So please use the false flag, apart from that the patch looks good to me. >> >> Regards, >> Christian. >> >> Am 26.01.2018 um 18:56 schrieb Liu, Shaoyun: >>> Sorry , I meant if I use the "false" flag and gpu_recovery is not enabled , the reset will be ignored. >>> >>> -----Original Message----- >>> From: Liu, Shaoyun >>> Sent: Friday, January 26, 2018 12:54 PM >>> To: Koenig, Christian; amd-gfx at lists.freedesktop.org >>> Subject: RE: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset >>> >>> If we did use force flag and amdgpu_gpu_recovery = 0 , the reset will be ignored . I'm kind of like this reset can go through like sriov . If we depends on the parameter amdgpu_gpu_recovery , it may think the GPU is hang and trigger the GPU reset when rocm submit some heavy compute stuff running and actually not hang . >>> >>> Regards >>> Shaoyun.liu >>> >>> -----Original Message----- >>> From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com] >>> Sent: Friday, January 26, 2018 12:41 PM >>> To: Liu, Shaoyun; amd-gfx at lists.freedesktop.org >>> Subject: Re: [PATCH 3/3] drm/amdgpu: reset kfd during amdgpu reset >>> >>> Am 26.01.2018 um 18:38 schrieb Shaoyun Liu: >>>> Change-Id: I222f4bb2c9a91c7a4764e6aa706e7d7f2e6d948d >>>> Signed-off-by: Shaoyun Liu <Shaoyun.Liu at amd.com> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 19 +++++++++++++++++++ >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 6 ++++++ >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++ >>>> 3 files changed, 30 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c >>>> index 2d99099..cb1ee26 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c >>>> @@ -239,6 +239,25 @@ int amdgpu_amdkfd_resume(struct amdgpu_device *adev) >>>> return r; >>>> } >>>> >>>> +void amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev) { >>>> + if (adev->kfd) >>>> + kgd2kfd->pre_reset(adev->kfd); >>>> +} >>>> + >>>> +void amdgpu_amdkfd_post_reset(struct amdgpu_device *adev) { >>>> + if (adev->kfd) >>>> + kgd2kfd->post_reset(adev->kfd); >>>> +} >>>> + >>>> +void amdgpu_amdkfd_gpu_reset(struct kgd_dev *kgd) { >>>> + struct amdgpu_device *adev = (struct amdgpu_device *)kgd; >>>> + >>>> + amdgpu_device_gpu_recover(adev, NULL, true); >>> Use false for the force parameter here, apart from that the set looks good to me. >>> >>> Regards, >>> Christian. >>> >>>> +} >>>> + >>>> int amdgpu_amdkfd_submit_ib(struct kgd_dev *kgd, enum kgd_engine_type engine, >>>> uint32_t vmid, uint64_t gpu_addr, >>>> uint32_t *ib_cmd, uint32_t ib_len) diff --git >>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h >>>> index 7c36e52..230761f 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h >>>> @@ -155,6 +155,12 @@ int amdgpu_amdkfd_copy_mem_to_mem(struct kgd_dev *kgd, struct kgd_mem *src_mem, >>>> bool amdgpu_amdkfd_is_kfd_vmid(struct amdgpu_device *adev, >>>> u32 vmid); >>>> >>>> +int amdgpu_amdkfd_pre_reset(struct amdgpu_device *adev); >>>> + >>>> +int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev); >>>> + >>>> +void amdgpu_amdkfd_gpu_reset(struct kgd_dev *kgd); >>>> + >>>> /* Shared API */ >>>> int map_bo(struct amdgpu_device *rdev, uint64_t va, void *vm, >>>> struct amdgpu_bo *bo, struct amdgpu_bo_va **bo_va); diff >>>> --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> index 94f837b..61e7d35 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> @@ -2660,6 +2660,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, >>>> atomic_inc(&adev->gpu_reset_counter); >>>> adev->in_gpu_reset = 1; >>>> >>>> + /* Block kfd */ >>>> + amdgpu_amdkfd_pre_reset(adev); >>>> + >>>> /* block TTM */ >>>> resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev); >>>> /* store modesetting */ >>>> @@ -2765,6 +2768,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, >>>> amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r); >>>> } else { >>>> dev_info(adev->dev, "GPU reset(%d) >>>> successed!\n",atomic_read(&adev->gpu_reset_counter)); >>>> + /*unlock kfd after a successfully recovery*/ >>>> + amdgpu_amdkfd_post_reset(adev); >>>> } >>>> >>>> amdgpu_vf_error_trans_all(adev); >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx at lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > _______________________________________________ > amd-gfx mailing list > amd-gfx at lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx