[AMD Public Use] Thanks for the clarifying, Dennis. So this is kind of race condition between normal GPU reset and ras GPU reset. I 'm fine with the change. The patch is Reviewed-by: Hawking Zhang <Hawking.Zhang@xxxxxxx> Regards, Hawking -----Original Message----- From: Li, Dennis <Dennis.Li@xxxxxxx> Sent: Wednesday, October 14, 2020 18:08 To: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> Subject: RE: [PATCH] drm/amdgpu: protect eeprom update from GPU reset [AMD Public Use] Hi, Hawking, Driver has multi-path into GPU reset, so driver couldn't guarantee that bad record update has been done before GPU reset. Best Regards Dennis Li -----Original Message----- From: Zhang, Hawking <Hawking.Zhang@xxxxxxx> Sent: Wednesday, October 14, 2020 5:52 PM To: Li, Dennis <Dennis.Li@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> Cc: Li, Dennis <Dennis.Li@xxxxxxx> Subject: RE: [PATCH] drm/amdgpu: protect eeprom update from GPU reset [AMD Public Use] Hmm, I think bad page record update is done ahead of scheduling gpu reset work. For mGPU case, shall we walk through all the nodes in a hive before issue gpu reset work? Regards, Hawking -----Original Message----- From: Dennis Li <Dennis.Li@xxxxxxx> Sent: Wednesday, October 14, 2020 17:41 To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> Cc: Li, Dennis <Dennis.Li@xxxxxxx> Subject: [PATCH] drm/amdgpu: protect eeprom update from GPU reset because i2c is unstable in GPU reset, driver need protect eeprom update from GPU reset, to not miss any bad page record. Signed-off-by: Dennis Li <Dennis.Li@xxxxxxx> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c index 0e64c39a2372..695bcfc5c983 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c @@ -149,7 +149,11 @@ static int __update_table_header(struct amdgpu_ras_eeprom_control *control, msg.addr = control->i2c_address; + /* i2c may be unstable in gpu reset */ + down_read(&adev->reset_sem); ret = i2c_transfer(&adev->pm.smu_i2c, &msg, 1); + up_read(&adev->reset_sem); + if (ret < 1) DRM_ERROR("Failed to write EEPROM table header, ret:%d", ret); @@ -557,7 +561,11 @@ int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control *control, control->next_addr += EEPROM_TABLE_RECORD_SIZE; } + /* i2c may be unstable in gpu reset */ + down_read(&adev->reset_sem); ret = i2c_transfer(&adev->pm.smu_i2c, msgs, num); + up_read(&adev->reset_sem); + if (ret < 1) { DRM_ERROR("Failed to process EEPROM table records, ret:%d", ret); -- 2.17.1 _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx