You can add my RB But to be honest, the current bare-metal GPU recover approach still look not good enough especially that soft_rest checking parts: 1) not all engine/IP on all version are implemented for this, and it's very very time cost to imple them all 2) like I said before, it only shows you if the engine is busy, but busy doesn't mean hang, we should rely one time out to judge a hang, although I know even time out still cannot prove engine hang, But anyway we need a rule, so I vote for leveraging time out totally and kick soft check out We can reduce all those codes of soft resets, and call whole Asic reset instead of those soft reset as long as a job hit time out (when amd_gpu_recocovery=1) BR Monk -----Original Message----- From: Grodzovsky, Andrey Sent: Friday, December 15, 2017 9:05 PM To: Koenig, Christian <Christian.Koenig at amd.com>; Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org Cc: maraeo at gmail.com Subject: Re: [PATCH] drm/amdgpu: Add gpu_recovery parameter On 12/14/2017 03:52 AM, Christian König wrote: > Am 14.12.2017 um 08:19 schrieb Liu, Monk: >>> Problem with this is that amdgpu_check_soft_reset will not be >>> called, this function which prints which IP block was hung even when >>> later we opt not to recover. >> I suggest instead to add a bool force_reset parameter to >> amdgpu_gpu_recover which will override amdgpu_gpu_recovery and we can >> set it to true from amdgpu_debugfs_gpu_recover only. >> >> [ML] why you need "check_soft_reset" be called ? I think soft reset >> checking is useless totally ... because with TDR feature, the only >> thing can Tell you if GPU hang is time out warning. > > And that's exactly why we call it, we just want the warning in the logs. > > Christian. Just a ping for any more comments or a RB. Thanks, Andrey > >> >> For soft checking, it only shows you if some IP is busy or not, but >> busy may not prove the engine is hang , it may just busy .... >> >> >> BR Monk >> >> -----Original Message----- >> From: Grodzovsky, Andrey >> Sent: 2017å¹´12æ??13æ?¥ 20:53 >> To: Koenig, Christian <Christian.Koenig at amd.com>; >> amd-gfx at lists.freedesktop.org >> Cc: Liu, Monk <Monk.Liu at amd.com>; maraeo at gmail.com >> Subject: Re: [PATCH] drm/amdgpu: Add gpu_recovery parameter >> >> >> >> On 12/13/2017 07:20 AM, Christian König wrote: >>> Am 12.12.2017 um 20:16 schrieb Andrey Grodzovsky: >>>> Add new parameter to control GPU recovery procedure. >>>> Retire old way of disabling GPU recovery by setting lockup_timeout >>>> == >>>> 0 and >>>> set default for lockup_timeout to 10s. >>>> >>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com> >>>> --- >>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h       | 1 + >>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++ >>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c   | 8 ++++++-- >>>>   3 files changed, 12 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> index 3735500..26abe03 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> @@ -126,6 +126,7 @@ extern int amdgpu_param_buf_per_se; >>>>   extern int amdgpu_job_hang_limit; >>>>   extern int amdgpu_lbpw; >>>>   extern int amdgpu_compute_multipipe; >>>> +extern int amdgpu_gpu_recovery; >>>>    #ifdef CONFIG_DRM_AMDGPU_SI >>>>   extern int amdgpu_si_support; >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> index 8d03baa..d84b57a 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> @@ -3030,6 +3030,11 @@ int amdgpu_gpu_recover(struct amdgpu_device >>>> *adev, struct amdgpu_job *job) >>>>           return 0; >>>>       } >>>>   +   if (!amdgpu_gpu_recovery) { >>>> +       DRM_INFO("GPU recovery disabled.\n"); >>>> +       return 0; >>>> +   } >>>> + >>> Please move this check into the caller of amdgpu_gpu_recover(). >>> >>> This way we can still trigger a GPU recovery manually or from the >>> hypervisor under SRIOV. >>> >>> Christian. >> Problem with this is that amdgpu_check_soft_reset will not be called, >> this function which prints which IP block was hung even when later we >> opt not to recover. >> I suggest instead to add a bool force_reset parameter to >> amdgpu_gpu_recover which will override amdgpu_gpu_recovery and we can >> set it to true from amdgpu_debugfs_gpu_recover only. >> >> Thanks, >> Andrey >> >>>>       dev_info(adev->dev, "GPU reset begin!\n"); >>>>        mutex_lock(&adev->lock_reset); diff --git >>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> index 0b039bd..5c612e9 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> @@ -90,7 +90,7 @@ int amdgpu_disp_priority = 0; >>>>   int amdgpu_hw_i2c = 0; >>>>   int amdgpu_pcie_gen2 = -1; >>>>   int amdgpu_msi = -1; >>>> -int amdgpu_lockup_timeout = 0; >>>> +int amdgpu_lockup_timeout = 10000; >>>>   int amdgpu_dpm = -1; >>>>   int amdgpu_fw_load_type = -1; >>>>   int amdgpu_aspm = -1; >>>> @@ -128,6 +128,7 @@ int amdgpu_param_buf_per_se = 0; >>>>   int amdgpu_job_hang_limit = 0; >>>>   int amdgpu_lbpw = -1; >>>>   int amdgpu_compute_multipipe = -1; >>>> +int amdgpu_gpu_recovery = 1; >>>>    MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in >>>> megabytes"); >>>>   module_param_named(vramlimit, amdgpu_vram_limit, int, 0600); @@ >>>> -165,7 +166,7 @@ module_param_named(pcie_gen2, amdgpu_pcie_gen2, >>>> int, 0444); >>>>   MODULE_PARM_DESC(msi, "MSI support (1 = enable, 0 = disable, -1 >>>> = auto)"); >>>>   module_param_named(msi, amdgpu_msi, int, 0444); >>>>   -MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms >>>> (default 0 = disable)"); >>>> +MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms >>>> +(default >>>> 10000)"); >>>>   module_param_named(lockup_timeout, amdgpu_lockup_timeout, int, >>>> 0444); >>>>    MODULE_PARM_DESC(dpm, "DPM support (1 = enable, 0 = disable, >>>> -1 = auto)"); @@ -280,6 +281,9 @@ module_param_named(lbpw, >>>> amdgpu_lbpw, int, 0444); >>>>   MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be >>>> spread across pipes (1 = enable, 0 = disable, -1 = auto)"); >>>>   module_param_named(compute_multipipe, amdgpu_compute_multipipe, >>>> int, 0444); >>>>   +MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, >>>> (1 = enable (default) , 0 = disable"); >>>> +module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444); >>>> + >>>>   #ifdef CONFIG_DRM_AMDGPU_SI >>>>    #if defined(CONFIG_DRM_RADEON) || >>>> defined(CONFIG_DRM_RADEON_MODULE) >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx at lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx at lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >