Re: [PATCH 3/4] drm/amdgpu: Add kernel parameter for ignoring bad page threshold

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 19.10.21 um 19:50 schrieb Kent Russell:
When a GPU hits the bad_page_threshold, it will not be initialized by
the amdgpu driver. This means that the table cannot be cleared, nor can
information gathering be performed (getting serial number, BDF, etc).
Add an override called ignore_bad_page_threshold that can be set to true
to still initialize the GPU, even when the bad page threshold has been
reached.

I would rather question the practice of this bad pages threshold.

As far as I know the hardware works perfectly fine even when we have more bad badles then expected, we should just warn really loudly about it.

Christian.


Cc: Luben Tuikov <luben.tuikov@xxxxxxx>
Cc: Mukul Joshi <Mukul.Joshi@xxxxxxx>
Signed-off-by: Kent Russell <kent.russell@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 13 +++++++++++++
  2 files changed, 14 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index d58e37fd01f4..b85b67a88a3d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
  extern int amdgpu_ras_enable;
  extern uint amdgpu_ras_mask;
  extern int amdgpu_bad_page_threshold;
+extern bool amdgpu_ignore_bad_page_threshold;
  extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
  extern int amdgpu_async_gfx_ring;
  extern int amdgpu_mcbp;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 96bd63aeeddd..3e9a7b072888 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -189,6 +189,7 @@ struct amdgpu_mgpu_info mgpu_info = {
  int amdgpu_ras_enable = -1;
  uint amdgpu_ras_mask = 0xffffffff;
  int amdgpu_bad_page_threshold = -1;
+bool amdgpu_ignore_bad_page_threshold;
  struct amdgpu_watchdog_timer amdgpu_watchdog_timer = {
  	.timeout_fatal_disable = false,
  	.period = 0x0, /* default to 0x0 (timeout disable) */
@@ -880,6 +881,18 @@ module_param_named(reset_method, amdgpu_reset_method, int, 0444);
  MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default value), 0 = disable bad page retirement)");
  module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
+/**
+ * DOC: ignore_bad_page_threshold (bool) Bad page threshold specifies
+ * the threshold value of faulty pages detected by RAS ECC. Once the
+ * threshold is hit, the GPU will not be initialized. Use this parameter
+ * to ignore the bad page threshold so that information gathering can
+ * still be performed. This also allows for booting the GPU to clear
+ * the RAS EEPROM table.
+ */
+
+MODULE_PARM_DESC(ignore_bad_page_threshold, "Ignore bad page threshold (false = respect bad page threshold (default value)");
+module_param_named(ignore_bad_page_threshold, amdgpu_ignore_bad_page_threshold, bool, 0644);
+
  MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup (8 if set to greater than 8 or less than 0, only affect gfx 8+)");
  module_param_named(num_kcq, amdgpu_num_kcq, int, 0444);




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux