Re: [PATCH] drm/amdgpu: protect eeprom update from GPU reset

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Thu, 15 Oct 2020 08:50:53 +0200

Looks like the right approach to me as well.

Patch is Reviewed-by: Christian König <christian.koenig@xxxxxxx>.

Regards,
Christian.

Am 14.10.20 um 13:44 schrieb Zhang, Hawking:
[AMD Public Use]

Thanks for the clarifying, Dennis. So this is kind of race condition between normal GPU reset and ras GPU reset. I 'm fine with the change. The patch is

Reviewed-by: Hawking Zhang <Hawking.Zhang@xxxxxxx>

Regards,
Hawking

-----Original Message-----
From: Li, Dennis <Dennis.Li@xxxxxxx>
Sent: Wednesday, October 14, 2020 18:08
To: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
Subject: RE: [PATCH] drm/amdgpu: protect eeprom update from GPU reset

[AMD Public Use]

Hi, Hawking,
       Driver has multi-path into GPU reset, so driver couldn't guarantee that bad record update has been done before GPU reset.

Best Regards
Dennis Li
-----Original Message-----
From: Zhang, Hawking <Hawking.Zhang@xxxxxxx>
Sent: Wednesday, October 14, 2020 5:52 PM
To: Li, Dennis <Dennis.Li@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
Cc: Li, Dennis <Dennis.Li@xxxxxxx>
Subject: RE: [PATCH] drm/amdgpu: protect eeprom update from GPU reset

[AMD Public Use]

Hmm, I think bad page record update is done ahead of scheduling gpu reset work. For mGPU case, shall we walk through all the nodes in a hive before issue gpu reset work?

Regards,
Hawking

-----Original Message-----
From: Dennis Li <Dennis.Li@xxxxxxx>
Sent: Wednesday, October 14, 2020 17:41
To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
Cc: Li, Dennis <Dennis.Li@xxxxxxx>
Subject: [PATCH] drm/amdgpu: protect eeprom update from GPU reset

because i2c is unstable in GPU reset, driver need protect eeprom update from GPU reset, to not miss any bad page record.

Signed-off-by: Dennis Li <Dennis.Li@xxxxxxx>

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 0e64c39a2372..695bcfc5c983 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -149,7 +149,11 @@ static int __update_table_header(struct amdgpu_ras_eeprom_control *control,
  
  	msg.addr = control->i2c_address;
  
+	/* i2c may be unstable in gpu reset */
+	down_read(&adev->reset_sem);
  	ret = i2c_transfer(&adev->pm.smu_i2c, &msg, 1);
+	up_read(&adev->reset_sem);
+
  	if (ret < 1)
  		DRM_ERROR("Failed to write EEPROM table header, ret:%d", ret);
  
@@ -557,7 +561,11 @@ int amdgpu_ras_eeprom_process_recods(struct amdgpu_ras_eeprom_control *control,
  		control->next_addr += EEPROM_TABLE_RECORD_SIZE;
  	}
  
+	/* i2c may be unstable in gpu reset */
+	down_read(&adev->reset_sem);
  	ret = i2c_transfer(&adev->pm.smu_i2c, msgs, num);
+	up_read(&adev->reset_sem);
+
  	if (ret < 1) {
  		DRM_ERROR("Failed to process EEPROM table records, ret:%d", ret);
  
--
2.17.1
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx