On Mon, Apr 29, 2024 at 4:07 AM Kenneth Feng <kenneth.feng@xxxxxxx> wrote: > > use the default reset for ras recovery > > Signed-off-by: Kenneth Feng <kenneth.feng@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > index a037e8fba29f..f92b2c4f0d5c 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > @@ -2437,6 +2437,7 @@ static void amdgpu_ras_do_recovery(struct work_struct *work) > struct amdgpu_device *adev = ras->adev; > struct list_head device_list, *device_list_handle = NULL; > struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev); > + int save_reset_method = amdgpu_reset_method; > > if (hive) { > atomic_set(&hive->ras_recovery, 1); > @@ -2501,7 +2502,13 @@ static void amdgpu_ras_do_recovery(struct work_struct *work) > } > } > > + if (amdgpu_gpu_recovery == 2) > + amdgpu_reset_method = -1; > + > amdgpu_device_gpu_recover(ras->adev, NULL, &reset_context); > + > + if (amdgpu_gpu_recovery == 2) > + amdgpu_reset_method = save_reset_method; This is racy. amdgpu_gpu_recovery is a global variable and will be referenced by all of the AMD GPUs in the system that are using amdgpu. To handle this properly, we should store the selected reset method in the adev structure and set that based on the module parameter at driver bind time. Then at runtime if we need to change the reset method, we can change the device specific one in adev. Maybe it would be better to have two variable in adev. E.g., default_reset_method and override_reset_method. In cases where have to use the default method, we can use that. In other cases, we can use the override method. Alex > } > atomic_set(&ras->in_recovery, 0); > if (hive) { > -- > 2.34.1 >