On 12/14/2017 03:52 AM, Christian König wrote: > Am 14.12.2017 um 08:19 schrieb Liu, Monk: >>> Problem with this is that amdgpu_check_soft_reset will not be >>> called, this function which prints which IP block was hung even when >>> later we opt not to recover. >> I suggest instead to add a bool force_reset parameter to >> amdgpu_gpu_recover which will override amdgpu_gpu_recovery and we can >> set it to true from amdgpu_debugfs_gpu_recover only. >> >> [ML] why you need "check_soft_reset" be called ? I think soft reset >> checking is useless totally ... because with TDR feature, the only >> thing can >> Tell you if GPU hang is time out warning. > > And that's exactly why we call it, we just want the warning in the logs. > > Christian. Just a ping for any more comments or a RB. Thanks, Andrey > >> >> For soft checking, it only shows you if some IP is busy or not, but >> busy may not prove the engine is hang , it may just busy .... >> >> >> BR Monk >> >> -----Original Message----- >> From: Grodzovsky, Andrey >> Sent: 2017å¹´12æ??13æ?¥ 20:53 >> To: Koenig, Christian <Christian.Koenig at amd.com>; >> amd-gfx at lists.freedesktop.org >> Cc: Liu, Monk <Monk.Liu at amd.com>; maraeo at gmail.com >> Subject: Re: [PATCH] drm/amdgpu: Add gpu_recovery parameter >> >> >> >> On 12/13/2017 07:20 AM, Christian König wrote: >>> Am 12.12.2017 um 20:16 schrieb Andrey Grodzovsky: >>>> Add new parameter to control GPU recovery procedure. >>>> Retire old way of disabling GPU recovery by setting lockup_timeout == >>>> 0 and >>>> set default for lockup_timeout to 10s. >>>> >>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com> >>>> --- >>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h       | 1 + >>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++ >>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c   | 8 ++++++-- >>>>   3 files changed, 12 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> index 3735500..26abe03 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> @@ -126,6 +126,7 @@ extern int amdgpu_param_buf_per_se; >>>>   extern int amdgpu_job_hang_limit; >>>>   extern int amdgpu_lbpw; >>>>   extern int amdgpu_compute_multipipe; >>>> +extern int amdgpu_gpu_recovery; >>>>    #ifdef CONFIG_DRM_AMDGPU_SI >>>>   extern int amdgpu_si_support; >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> index 8d03baa..d84b57a 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> @@ -3030,6 +3030,11 @@ int amdgpu_gpu_recover(struct amdgpu_device >>>> *adev, struct amdgpu_job *job) >>>>           return 0; >>>>       } >>>>   +   if (!amdgpu_gpu_recovery) { >>>> +       DRM_INFO("GPU recovery disabled.\n"); >>>> +       return 0; >>>> +   } >>>> + >>> Please move this check into the caller of amdgpu_gpu_recover(). >>> >>> This way we can still trigger a GPU recovery manually or from the >>> hypervisor under SRIOV. >>> >>> Christian. >> Problem with this is that amdgpu_check_soft_reset will not be called, >> this function which prints which IP block was hung even when later we >> opt not to recover. >> I suggest instead to add a bool force_reset parameter to >> amdgpu_gpu_recover which will override amdgpu_gpu_recovery and we can >> set it to true from amdgpu_debugfs_gpu_recover only. >> >> Thanks, >> Andrey >> >>>>       dev_info(adev->dev, "GPU reset begin!\n"); >>>>        mutex_lock(&adev->lock_reset); diff --git >>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> index 0b039bd..5c612e9 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> @@ -90,7 +90,7 @@ int amdgpu_disp_priority = 0; >>>>   int amdgpu_hw_i2c = 0; >>>>   int amdgpu_pcie_gen2 = -1; >>>>   int amdgpu_msi = -1; >>>> -int amdgpu_lockup_timeout = 0; >>>> +int amdgpu_lockup_timeout = 10000; >>>>   int amdgpu_dpm = -1; >>>>   int amdgpu_fw_load_type = -1; >>>>   int amdgpu_aspm = -1; >>>> @@ -128,6 +128,7 @@ int amdgpu_param_buf_per_se = 0; >>>>   int amdgpu_job_hang_limit = 0; >>>>   int amdgpu_lbpw = -1; >>>>   int amdgpu_compute_multipipe = -1; >>>> +int amdgpu_gpu_recovery = 1; >>>>    MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in >>>> megabytes"); >>>>   module_param_named(vramlimit, amdgpu_vram_limit, int, 0600); @@ >>>> -165,7 +166,7 @@ module_param_named(pcie_gen2, amdgpu_pcie_gen2, int, >>>> 0444); >>>>   MODULE_PARM_DESC(msi, "MSI support (1 = enable, 0 = disable, -1 = >>>> auto)"); >>>>   module_param_named(msi, amdgpu_msi, int, 0444); >>>>   -MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms >>>> (default 0 = disable)"); >>>> +MODULE_PARM_DESC(lockup_timeout, "GPU lockup timeout in ms (default >>>> 10000)"); >>>>   module_param_named(lockup_timeout, amdgpu_lockup_timeout, int, >>>> 0444); >>>>    MODULE_PARM_DESC(dpm, "DPM support (1 = enable, 0 = disable, -1 = >>>> auto)"); @@ -280,6 +281,9 @@ module_param_named(lbpw, amdgpu_lbpw, >>>> int, 0444); >>>>   MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be >>>> spread across pipes (1 = enable, 0 = disable, -1 = auto)"); >>>>   module_param_named(compute_multipipe, amdgpu_compute_multipipe, >>>> int, 0444); >>>>   +MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 >>>> = enable (default) , 0 = disable"); >>>> +module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444); >>>> + >>>>   #ifdef CONFIG_DRM_AMDGPU_SI >>>>    #if defined(CONFIG_DRM_RADEON) || >>>> defined(CONFIG_DRM_RADEON_MODULE) >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx at lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx at lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >