On Fri, Jan 03, 2025 at 08:21:43AM +0000, Koenig, Christian wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > > Hi Shuai, > > setting gpu_recovery=0 is not even remotely related to RAS. If that > option affects RAS behavior in any way then that is a bug. > > The purpose of setting gpu_recovery=0 is to disable resets after a > submission timeout most likely caused by an unrecoverable HW error. > > This is necessary for JTAG debugging in our labs during HW bringup and > should *NEVER* be used on any production system. > > We already discussed with upstream maintainers that we should probably > mark the kernel as tainted to indicate that it might be in an unreliable > HW state. I will push for this now since there seems to be a big > misunderstanding what this option does. module_param_unsafe and friends really should be the default for module options really, since generally they're just for debugging and other hacks. With multiple gpus you can't control options per-device with module options in a reasonable way, so that's all no-go. So might want to go large-scale relabelling module options while you're at it. -Sima > > Regards, > Christian. > > ________________________________________ > Von: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> > Gesendet: Montag, 30. Dezember 2024 09:50 > An: Koenig, Christian; Deucher, Alexander; Pan, Xinhui; airlied@xxxxxxxxx; simona@xxxxxxxx; Lazar, Lijo; Ma, Le; hamza.mahfooz@xxxxxxx; tzimmermann@xxxxxxx; Liu, Shaoyun; Jun.Ma2@xxxxxxx > Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; tianruidong@xxxxxxxxxxxxxxxxx > Betreff: Re: [PATCH] drm/amdgpu: Enable runtime modification of gpu_recovery parameter with validation > > > > 在 2024/12/30 04:11, Christian König 写道: > > Am 28.12.24 um 07:32 schrieb Shuai Xue: > >> It's observed that most GPU jobs utilize less than one server, typically > >> with each GPU being used by an independent job. If a job consumed poisoned > >> data, a SIGBUS signal will be sent to terminate it. Meanwhile, the > >> gpu_recovery parameter is set to -1 by default, the amdgpu driver resets > >> all GPUs on the server. As a result, all jobs are terminated. Setting > >> gpu_recovery to 0 provides an opportunity to preemptively evacuate other > >> jobs and subsequently manually reset all GPUs. > > > > *BIG* NAK to this whole approach! > > > > Setting gpu_recovery to 0 in a production environment is *NOT* supported at all and should never be done. > > > > This is a pure debugging feature for JTAG debugging and can result in random crashes and/or compromised data. > > > > Please don't tell me that you tried to use this in a production environment. > > > > Regards, > > Christian. > > Hi, Christian, > > Thank you for your quick reply. > > When an application encounters uncorrected error, it will be terminate by a > SIGBUS signal. The related bad pages are retired. I did not figure why > gpu_recovery=0 can result in random crashes and/or compromised data. > > I test with error injection in my dev enviroment: > > 1. load driver with gpu_recovery=0 > #cat /sys/bus/pci/drivers/amdgpu/module/parameters/gpu_recovery > 0 > > 2. inject a Uncorrectable ECC error to UMC > #sudo amdgpuras -d 0 -b 2 -t 8 > Poison inject, logical addr:0x7f2b495f9000 physical addr:0x27f5d4b000 vmid:5 > Bus error > > 3. GPU 0000:0a:00.0 reports error address with PA > #dmesg | grep 27f5 > [424443.174154] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5d43080 Row:0x1fd7 Col:0x0 Bank:0xa Channel:0x30 > [424443.174156] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5d4b080 Row:0x1fd7 Col:0x4 Bank:0xa Channel:0x30 > [424443.174158] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5d53080 Row:0x1fd7 Col:0x8 Bank:0xa Channel:0x30 > [424443.174160] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5d5b080 Row:0x1fd7 Col:0xc Bank:0xa Channel:0x30 > [424443.174162] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5f43080 Row:0x1fd7 Col:0x10 Bank:0xa Channel:0x30 > [424443.174169] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5f4b080 Row:0x1fd7 Col:0x14 Bank:0xa Channel:0x30 > [424443.174172] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5f53080 Row:0x1fd7 Col:0x18 Bank:0xa Channel:0x30 > [424443.174174] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5f5b080 Row:0x1fd7 Col:0x1c Bank:0xa Channel:0x30 > > 4. All the related bad pages are AMDGPU_RAS_RETIRE_PAGE_RESERVED. > #cat /sys/devices/pci0000:05/0000:05:01.0/0000:06:00.0/0000:07:00.0/0000:08:00.0/0000:09:00.0/0000:0a:00.0/ras/gpu_vram_bad_pages | grep 27f5 > 0x027f5d43 : 0x00001000 : R > 0x027f5d4b : 0x00001000 : R > 0x027f5d53 : 0x00001000 : R > 0x027f5d5b : 0x00001000 : R > 0x027f5f43 : 0x00001000 : R > 0x027f5f4b : 0x00001000 : R > 0x027f5f53 : 0x00001000 : R > 0x027f5f5b : 0x00001000 : R > > AFAIK, the reserved bad pages will not be used any more. Please correct me if > I missed anything. > > DRAM ECC issues are the most common problems. When it occurs, the kernel will > attempt to hard-offline the page, by trying to unmap the page or killing any > owner, or triggering IO errors if needed. > > ECC error is also common for HBM and error isolation from each user's job is a > basic requirement in public cloud. For NVIDIA GPU, a ECC error could be > contained to a process. > > > XID 94: Contained ECC error > > XID 95: UnContained ECC error > > > > For Xid 94, these errors are contained to one application, and the application > > that encountered this error must be restarted. All other applications running > > at the time of the Xid are unaffected. It is recommended to reset the GPU when > > convenient. Applications can continue to be run until the reset can be > > performed. > > > > For Xid 95, these errors affect multiple applications, and the affected GPU > > must be reset before applications can restart. > > > > https://docs.nvidia.com/deploy/xid-errors/ > > Does AMD GPU provide a similar way to achieve error isolation requirement? > > Best Regards, > Shuai > > > > >> However, this parameter is > >> read-only, necessitating correct settings at driver load. And reloading the > >> GPU driver in a production environment can be challenging due to reference > >> counts maintained by various monitoring services. > >> > >> Set the gpu_recovery parameter with read-write permission to enable runtime > >> modification. It will enables users to dynamically manage GPU recovery > >> mechanisms based on real-time requirements or conditions. > >> > >> Signed-off-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> > >> --- > >> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 26 ++++++++++++++++++++++++- > >> 1 file changed, 25 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > >> index 38686203bea6..03dd902e1cec 100644 > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > >> @@ -563,12 +563,36 @@ module_param_named(lbpw, amdgpu_lbpw, int, 0444); > >> MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be spread across pipes (1 = enable, 0 = disable, -1 = auto)"); > >> module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444); > >> +static int amdgpu_set_gpu_recovery(const char *buf, > >> + const struct kernel_param *kp) > >> +{ > >> + unsigned long val; > >> + int ret; > >> + > >> + ret = kstrtol(buf, 10, &val); > >> + if (ret < 0) > >> + return ret; > >> + > >> + if (val != 1 && val != 0 && val != -1) { > >> + pr_err("Invalid value for gpu_recovery: %ld, excepted 0,1,-1\n", > >> + val); > >> + return -EINVAL; > >> + } > >> + > >> + return param_set_int(buf, kp); > >> +} > >> + > >> +static const struct kernel_param_ops amdgpu_gpu_recovery_ops = { > >> + .set = amdgpu_set_gpu_recovery, > >> + .get = param_get_int, > >> +}; > >> + > >> /** > >> * DOC: gpu_recovery (int) > >> * Set to enable GPU recovery mechanism (1 = enable, 0 = disable). The default is -1 (auto, disabled except SRIOV). > >> */ > >> MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto)"); > >> -module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444); > >> +module_param_cb(gpu_recovery, &amdgpu_gpu_recovery_ops, &amdgpu_gpu_recovery, 0644); > >> /** > >> * DOC: emu_mode (int) -- Simona Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch