It's observed that most GPU jobs utilize less than one server, typically with each GPU being used by an independent job. If a job consumed poisoned data, a SIGBUS signal will be sent to terminate it. Meanwhile, the gpu_recovery parameter is set to -1 by default, the amdgpu driver resets all GPUs on the server. As a result, all jobs are terminated. Setting gpu_recovery to 0 provides an opportunity to preemptively evacuate other jobs and subsequently manually reset all GPUs. However, this parameter is read-only, necessitating correct settings at driver load. And reloading the GPU driver in a production environment can be challenging due to reference counts maintained by various monitoring services. Set the gpu_recovery parameter with read-write permission to enable runtime modification. It will enables users to dynamically manage GPU recovery mechanisms based on real-time requirements or conditions. Signed-off-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 26 ++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 38686203bea6..03dd902e1cec 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -563,12 +563,36 @@ module_param_named(lbpw, amdgpu_lbpw, int, 0444); MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be spread across pipes (1 = enable, 0 = disable, -1 = auto)"); module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444); +static int amdgpu_set_gpu_recovery(const char *buf, + const struct kernel_param *kp) +{ + unsigned long val; + int ret; + + ret = kstrtol(buf, 10, &val); + if (ret < 0) + return ret; + + if (val != 1 && val != 0 && val != -1) { + pr_err("Invalid value for gpu_recovery: %ld, excepted 0,1,-1\n", + val); + return -EINVAL; + } + + return param_set_int(buf, kp); +} + +static const struct kernel_param_ops amdgpu_gpu_recovery_ops = { + .set = amdgpu_set_gpu_recovery, + .get = param_get_int, +}; + /** * DOC: gpu_recovery (int) * Set to enable GPU recovery mechanism (1 = enable, 0 = disable). The default is -1 (auto, disabled except SRIOV). */ MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto)"); -module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444); +module_param_cb(gpu_recovery, &amdgpu_gpu_recovery_ops, &amdgpu_gpu_recovery, 0644); /** * DOC: emu_mode (int) -- 2.39.3