Re: [PATCH 1/2] drm/amdgpu/ras: Extract amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages

Andrey Grodzovsky <Andrey.Grodzovsky@xxxxxxx> · Mon, 18 Nov 2019 09:05:26 -0500

Christian - ping.

Andrey

On 11/14/19 11:41 AM, Andrey Grodzovsky wrote:

On 11/13/19 10:09 PM, Zhou1, Tao wrote:
Two questions:

1. "we lose all reservation during ASIC reset"
Are you sure of it? I remember the content of vram may be lost after 
reset but the allocated status could be reserved.

Yea, now that I am thinking of it I think i might have confused it 
with BO content recovery in amdgpu_device_recover_vram for shadow 
buffers which are page tables only but just for VRAM reservation 
status this might be irrelevant... Christian - can you confirm Tao is 
correct on this ?


2. You change the bad page handle flow from:

detect bad page -> reserve vram for bad page -> save bad page info to 
eeprom -> gpu reset

to:

detect bad page (this time) -> save bad page (last time) info to 
eeprom -> gpu reset -> reserve vram for bad page (this time)

Even though if I am wrong on the first point this is irrelevant but 
still - Why saving bad page is from last time ? See 
https://cgit.freedesktop.org/~agd5f/linux/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c?h=amd-staging-drm-next#n1436 
- the save count is the latest so as the content of 
data->bps[control->num_recs] up to data->bps[control->num_recs + 
save_count] as those are updated in 
amdgpu_umc_process_ras_data_cb->amdgpu_ras_add_bad_pages which is 
called right BEFORE amdgpu_ras_reset_gpu->amdgpu_ras_save_bad_pages in 
the interrupt sequence

Andrey



Is that right?  Saving bad page (this time) info to eeprom is delayed 
to the next time that bad page is detected? But the time of detecting 
bad page is random.
I think the bad page info would be lost without saving to eeprom if 
power off occurs.

detect bad page (this time) -> save bad page (last time) info to 
eeprom -> gpu reset -> reserve vram for bad page (this time) -> 
poweroff/system reset (and bad page info (this time) is lost)

Tao

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of
Andrey Grodzovsky
Sent: 2019年11月14日 6:45
To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: alexdeucher@xxxxxxxxx; Grodzovsky, Andrey
<Andrey.Grodzovsky@xxxxxxx>; Chen, Guchun <Guchun.Chen@xxxxxxx>;
Zhang, Hawking <Hawking.Zhang@xxxxxxx>
Subject: [PATCH 1/2] drm/amdgpu/ras: Extract
amdgpu_ras_save_bad_pages from amdgpu_ras_reserve_bad_pages

We want to be be able to call them independently from one another so 
that
before GPU reset just amdgpu_ras_save_bad_pages is called.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++++----------
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  4 +++-
  2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 4044834..600a86d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1421,7 +1421,7 @@ int amdgpu_ras_add_bad_pages(struct
amdgpu_device *adev,
   * write error record array to eeprom, the function should be
   * protected by recovery_lock
   */
-static int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
+int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
  {
      struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
      struct ras_err_handler_data *data;
@@ -1520,7 +1520,7 @@ int amdgpu_ras_reserve_bad_pages(struct
amdgpu_device *adev)
      struct ras_err_handler_data *data;
      uint64_t bp;
      struct amdgpu_bo *bo = NULL;
-    int i, ret = 0;
+    int i;

      if (!con || !con->eh_data)
          return 0;
@@ -1548,12 +1548,9 @@ int amdgpu_ras_reserve_bad_pages(struct
amdgpu_device *adev)
          data->last_reserved = i + 1;
          bo = NULL;
      }
-
-    /* continue to save bad pages to eeprom even reesrve_vram fails */
-    ret = amdgpu_ras_save_bad_pages(adev);
  out:
      mutex_unlock(&con->recovery_lock);
-    return ret;
+    return 0;
  }

  /* called when driver unload */
@@ -1615,14 +1612,11 @@ int amdgpu_ras_recovery_init(struct
amdgpu_device *adev)
          ret = amdgpu_ras_load_bad_pages(adev);
          if (ret)
              goto free;
-        ret = amdgpu_ras_reserve_bad_pages(adev);
-        if (ret)
-            goto release;
+        amdgpu_ras_reserve_bad_pages(adev);
      }

      return 0;

-release:
      amdgpu_ras_release_bad_pages(adev);
  free:
      kfree((*data)->bps);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index f80fd34..12b0797 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -492,6 +492,8 @@ unsigned long
amdgpu_ras_query_error_count(struct amdgpu_device *adev,  int
amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
          struct eeprom_table_record *bps, int pages);

+int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
+
  int amdgpu_ras_reserve_bad_pages(struct amdgpu_device *adev);

  static inline int amdgpu_ras_reset_gpu(struct amdgpu_device *adev, 
@@ -
503,7 +505,7 @@ static inline int amdgpu_ras_reset_gpu(struct
amdgpu_device *adev,
       * i2c may be unstable in gpu reset
       */
      if (in_task())
-        amdgpu_ras_reserve_bad_pages(adev);
+        amdgpu_ras_save_bad_pages(adev);

      if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
          schedule_work(&ras->recovery_work);
--
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx