Re: [PATCH 1/3] drm/amdgpu: recreate retired page's bo if vram get lost in gpu reset

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Mon, 30 Sep 2019 11:24:52 +0200

Am 30.09.19 um 11:06 schrieb Zhou1, Tao:
Hi Christian:

So the allocation information of a bo's vram page could be reserved after vram lost?

Yes, exactly. VRAM lost just means that we are returning an error code 
to userspace when they try to use a context created before VRAM was lost.

In fact, this series could be dropped if the result of amdgpu_bo_create_reserved is guaranteed to be failed after vram lost (protect bad pages from reallocating).

The BOs are not freed in any way, so I think Andrey meant something else.

Regards,
Christian.


Hi Andrey:

A bad page's allocation information will not be lost in gpu reset according to Christian's comments. Do you have any other concern?

Regards,
Tao

-----Original Message-----
From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
Sent: 2019年9月30日 16:35
To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx;
Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>; Chen, Guchun
<Guchun.Chen@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>
Subject: Re: [PATCH 1/3] drm/amdgpu: recreate retired page's bo if vram get
lost in gpu reset

Am 30.09.19 um 05:19 schrieb Zhou1, Tao:
the info of retired page's bo may be lost if vram lost is encountered
in gpu reset (gpu page table in vram may be lost), force to recreate
all bos

v2: simplify NULL pointer check
      add more comments
Repeating on the v2 of the patch set, this is complete nonsense.

When VRAM is lost the BOs and their reservation still exits, only the content
is lost because the MC is has been resetted.

Regards,
Christian.

Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx>
Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 48
++++++++++++++++++++++
   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  1 +
   3 files changed, 50 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 62955156653c..a89f6d053b3f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3675,6 +3675,7 @@ static int amdgpu_do_asic_reset(struct
amdgpu_hive_info *hive,
   				if (vram_lost) {
   					DRM_INFO("VRAM is lost due to GPU
reset!\n");
   					amdgpu_inc_vram_lost(tmp_adev);
+
	amdgpu_ras_recovery_reset(tmp_adev);
   				}

   				r = amdgpu_gtt_mgr_recover(
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 486568ded6d6..3f2a2f13e4c6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1573,6 +1573,54 @@ static int amdgpu_ras_recovery_fini(struct
amdgpu_device *adev)

   	return 0;
   }
+
+/*
+ * the info of retired page's bo may be lost if vram lost is
+encountered
+ * in gpu reset (gpu page table in vram may be lost), force to
+recreate
+ * all bos
+ */
+void amdgpu_ras_recovery_reset(struct amdgpu_device *adev) {
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+	struct ras_err_handler_data *data;
+	uint64_t bp;
+	struct amdgpu_bo *bo = NULL;
+	int i;
+
+	if (!con || !con->eh_data)
+		return;
+
+	/* Note: It's called in gpu reset, recovery_lock may be acquired
before
+	 * gpu reset. The following code is out of protect of con-
recovery_lock
+	 * in case of deadlock and bps_bo should be recreated (if successfully)
+	 * whether recovery_lock is locked or not.
+	 * We assume there is no other ras recovery operation simultaneous
during
+	 * gpu reset.
+	 */
+	data = con->eh_data;
+	/* force to reserve all retired page again */
+	data->last_reserved = 0;
+
+	for (i = data->last_reserved; i < data->count; i++) {
+		bp = data->bps[i].retired_page;
+
+		/* There are three cases of reserve error should be ignored:
+		 * 1) a ras bad page has been allocated (used by someone);
+		 * 2) a ras bad page has been reserved (duplicate error
injection
+		 *    for one page);
+		 * 3) bo info isn't lost in gpu reset
+		 */
+		if (amdgpu_bo_create_kernel_at(adev, bp <<
AMDGPU_GPU_PAGE_SHIFT,
+					       AMDGPU_GPU_PAGE_SIZE,
+					       AMDGPU_GEM_DOMAIN_VRAM,
+					       &bo, NULL))
+			DRM_NOTE("RAS NOTE: recreate bo for retired page
0x%llx fail\n", bp);
+		else
+			data->bps_bo[i] = bo;
+		data->last_reserved = i + 1;
+		bo = NULL;
+	}
+}
   /* recovery end */

   /* return 0 if ras will reset gpu and repost.*/ diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index f80fd3428c98..7a606d3be806 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -479,6 +479,7 @@ static inline int amdgpu_ras_is_supported(struct
amdgpu_device *adev,
   }

   int amdgpu_ras_recovery_init(struct amdgpu_device *adev);
+void amdgpu_ras_recovery_reset(struct amdgpu_device *adev);
   int amdgpu_ras_request_reset_on_boot(struct amdgpu_device *adev,
   		unsigned int block);

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx