________________________________________
Von: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx>
Gesendet: Montag, 30. Dezember 2024 09:50
An: Koenig, Christian; Deucher, Alexander; Pan, Xinhui;
airlied@xxxxxxxxx; simona@xxxxxxxx; Lazar, Lijo; Ma, Le;
hamza.mahfooz@xxxxxxx; tzimmermann@xxxxxxx; Liu, Shaoyun;
Jun.Ma2@xxxxxxx
Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx;
linux-kernel@xxxxxxxxxxxxxxx; tianruidong@xxxxxxxxxxxxxxxxx
Betreff: Re: [PATCH] drm/amdgpu: Enable runtime modification of
gpu_recovery parameter with validation
在 2024/12/30 04:11, Christian König 写道:
Am 28.12.24 um 07:32 schrieb Shuai Xue:
It's observed that most GPU jobs utilize less than one server,
typically
with each GPU being used by an independent job. If a job consumed
poisoned
data, a SIGBUS signal will be sent to terminate it. Meanwhile, the
gpu_recovery parameter is set to -1 by default, the amdgpu driver
resets
all GPUs on the server. As a result, all jobs are terminated. Setting
gpu_recovery to 0 provides an opportunity to preemptively evacuate
other
jobs and subsequently manually reset all GPUs.
*BIG* NAK to this whole approach!
Setting gpu_recovery to 0 in a production environment is *NOT*
supported at all and should never be done.
This is a pure debugging feature for JTAG debugging and can result
in random crashes and/or compromised data.
Please don't tell me that you tried to use this in a production
environment.
Regards,
Christian.
Hi, Christian,
Thank you for your quick reply.
When an application encounters uncorrected error, it will be
terminate by a
SIGBUS signal. The related bad pages are retired. I did not figure why
gpu_recovery=0 can result in random crashes and/or compromised data.
I test with error injection in my dev enviroment:
1. load driver with gpu_recovery=0
#cat /sys/bus/pci/drivers/amdgpu/module/parameters/gpu_recovery
0
2. inject a Uncorrectable ECC error to UMC
#sudo amdgpuras -d 0 -b 2 -t 8
Poison inject, logical addr:0x7f2b495f9000 physical addr:0x27f5d4b000
vmid:5
Bus error
3. GPU 0000:0a:00.0 reports error address with PA
#dmesg | grep 27f5
[424443.174154] amdgpu 0000:0a:00.0: amdgpu: Error
Address(PA):0x27f5d43080 Row:0x1fd7 Col:0x0 Bank:0xa Channel:0x30
[424443.174156] amdgpu 0000:0a:00.0: amdgpu: Error
Address(PA):0x27f5d4b080 Row:0x1fd7 Col:0x4 Bank:0xa Channel:0x30
[424443.174158] amdgpu 0000:0a:00.0: amdgpu: Error
Address(PA):0x27f5d53080 Row:0x1fd7 Col:0x8 Bank:0xa Channel:0x30
[424443.174160] amdgpu 0000:0a:00.0: amdgpu: Error
Address(PA):0x27f5d5b080 Row:0x1fd7 Col:0xc Bank:0xa Channel:0x30
[424443.174162] amdgpu 0000:0a:00.0: amdgpu: Error
Address(PA):0x27f5f43080 Row:0x1fd7 Col:0x10 Bank:0xa Channel:0x30
[424443.174169] amdgpu 0000:0a:00.0: amdgpu: Error
Address(PA):0x27f5f4b080 Row:0x1fd7 Col:0x14 Bank:0xa Channel:0x30
[424443.174172] amdgpu 0000:0a:00.0: amdgpu: Error
Address(PA):0x27f5f53080 Row:0x1fd7 Col:0x18 Bank:0xa Channel:0x30
[424443.174174] amdgpu 0000:0a:00.0: amdgpu: Error
Address(PA):0x27f5f5b080 Row:0x1fd7 Col:0x1c Bank:0xa Channel:0x30
4. All the related bad pages are AMDGPU_RAS_RETIRE_PAGE_RESERVED.
#cat
/sys/devices/pci0000:05/0000:05:01.0/0000:06:00.0/0000:07:00.0/0000:08:00.0/0000:09:00.0/0000:0a:00.0/ras/gpu_vram_bad_pages
| grep 27f5
0x027f5d43 : 0x00001000 : R
0x027f5d4b : 0x00001000 : R
0x027f5d53 : 0x00001000 : R
0x027f5d5b : 0x00001000 : R
0x027f5f43 : 0x00001000 : R
0x027f5f4b : 0x00001000 : R
0x027f5f53 : 0x00001000 : R
0x027f5f5b : 0x00001000 : R
AFAIK, the reserved bad pages will not be used any more. Please
correct me if
I missed anything.
DRAM ECC issues are the most common problems. When it occurs, the
kernel will
attempt to hard-offline the page, by trying to unmap the page or
killing any
owner, or triggering IO errors if needed.
ECC error is also common for HBM and error isolation from each user's
job is a
basic requirement in public cloud. For NVIDIA GPU, a ECC error could be
contained to a process.
XID 94: Contained ECC error
XID 95: UnContained ECC error
For Xid 94, these errors are contained to one application, and the
application
that encountered this error must be restarted. All other
applications running
at the time of the Xid are unaffected. It is recommended to reset
the GPU when
convenient. Applications can continue to be run until the reset can be
performed.
For Xid 95, these errors affect multiple applications, and the
affected GPU
must be reset before applications can restart.
https://docs.nvidia.com/deploy/xid-errors/
Does AMD GPU provide a similar way to achieve error isolation
requirement?
Best Regards,
Shuai
However, this parameter is
read-only, necessitating correct settings at driver load. And
reloading the
GPU driver in a production environment can be challenging due to
reference
counts maintained by various monitoring services.
Set the gpu_recovery parameter with read-write permission to enable
runtime
modification. It will enables users to dynamically manage GPU recovery
mechanisms based on real-time requirements or conditions.
Signed-off-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 26
++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 38686203bea6..03dd902e1cec 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -563,12 +563,36 @@ module_param_named(lbpw, amdgpu_lbpw, int,
0444);
MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be
spread across pipes (1 = enable, 0 = disable, -1 = auto)");
module_param_named(compute_multipipe, amdgpu_compute_multipipe,
int, 0444);
+static int amdgpu_set_gpu_recovery(const char *buf,
+ const struct kernel_param *kp)
+{
+ unsigned long val;
+ int ret;
+
+ ret = kstrtol(buf, 10, &val);
+ if (ret < 0)
+ return ret;
+
+ if (val != 1 && val != 0 && val != -1) {
+ pr_err("Invalid value for gpu_recovery: %ld, excepted
0,1,-1\n",
+ val);
+ return -EINVAL;
+ }
+
+ return param_set_int(buf, kp);
+}
+
+static const struct kernel_param_ops amdgpu_gpu_recovery_ops = {
+ .set = amdgpu_set_gpu_recovery,
+ .get = param_get_int,
+};
+
/**
* DOC: gpu_recovery (int)
* Set to enable GPU recovery mechanism (1 = enable, 0 =
disable). The default is -1 (auto, disabled except SRIOV).
*/
MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism,
(1 = enable, 0 = disable, -1 = auto)");
-module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
+module_param_cb(gpu_recovery, &amdgpu_gpu_recovery_ops,
&amdgpu_gpu_recovery, 0644);
/**
* DOC: emu_mode (int)