On Fri, Feb 28, 2025 at 06:54:12PM -0300, André Almeida wrote: > Hi Raag, > > On 2/28/25 11:20, Raag Jadav wrote: > > Cc: Lucas > > > > On Fri, Feb 28, 2025 at 09:13:52AM -0300, André Almeida wrote: > > > When a device get wedged, it might be caused by a guilty application. > > > For userspace, knowing which app was the cause can be useful for some > > > situations, like for implementing a policy, logs or for giving a chance > > > for the compositor to let the user know what app caused the problem. > > > This is an optional argument, when `PID=-1` there's no information about > > > the app caused the problem, or if any app was involved during the hang. > > > > > > Sometimes just the PID isn't enough giving that the app might be already > > > dead by the time userspace will try to check what was this PID's name, > > > so to make the life easier also notify what's the app's name in the user > > > event. > > > > > > Signed-off-by: André Almeida <andrealmeid@xxxxxxxxxx> > > > --- > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- > > > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +- > > > drivers/gpu/drm/drm_drv.c | 16 +++++++++++++--- > > > drivers/gpu/drm/i915/gt/intel_reset.c | 3 ++- > > > drivers/gpu/drm/xe/xe_device.c | 3 ++- > > > include/drm/drm_device.h | 8 ++++++++ > > > include/drm/drm_drv.h | 3 ++- > > > 7 files changed, 29 insertions(+), 8 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > index 24ba52d76045..00b9b87dafd8 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > @@ -6124,7 +6124,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, > > > atomic_set(&adev->reset_domain->reset_res, r); > > > if (!r) > > > - drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE); > > > + drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL); > > > return r; > > > } > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > > index ef1b77f1e88f..3ed9cbcab1ad 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > > @@ -150,7 +150,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job) > > > amdgpu_fence_driver_force_completion(ring); > > > if (amdgpu_ring_sched_ready(ring)) > > > drm_sched_start(&ring->sched, 0); > > > - drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE); > > > + drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE, NULL); > > > dev_err(adev->dev, "Ring %s reset succeeded\n", ring->sched.name); > > > goto exit; > > > } > > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c > > > index 17fc5dc708f4..48faafd82a99 100644 > > > --- a/drivers/gpu/drm/drm_drv.c > > > +++ b/drivers/gpu/drm/drm_drv.c > > > @@ -522,6 +522,7 @@ static const char *drm_get_wedge_recovery(unsigned int opt) > > > * drm_dev_wedged_event - generate a device wedged uevent > > > * @dev: DRM device > > > * @method: method(s) to be used for recovery > > > + * @info: optional information about the guilty app > > > * > > > * This generates a device wedged uevent for the DRM device specified by @dev. > > > * Recovery @method\(s) of choice will be sent in the uevent environment as > > > @@ -534,13 +535,14 @@ static const char *drm_get_wedge_recovery(unsigned int opt) > > > * > > > * Returns: 0 on success, negative error code otherwise. > > > */ > > > -int drm_dev_wedged_event(struct drm_device *dev, unsigned long method) > > > +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method, > > > + struct drm_wedge_app_info *info) > > > { > > > const char *recovery = NULL; > > > unsigned int len, opt; > > > /* Event string length up to 28+ characters with available methods */ > > > - char event_string[32]; > > > - char *envp[] = { event_string, NULL }; > > > + char event_string[32], pid_string[15], comm_string[TASK_COMM_LEN]; > > > + char *envp[] = { event_string, pid_string, comm_string, NULL }; > > > len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED="); > > > @@ -562,6 +564,14 @@ int drm_dev_wedged_event(struct drm_device *dev, unsigned long method) > > > drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ? > > > "but recovered through reset" : "needs recovery"); > > > + if (info) { > > > + snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid); > > > + snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm); > > > + } else { > > > + snprintf(pid_string, sizeof(pid_string), "%s", "PID=-1"); > > > + snprintf(comm_string, sizeof(comm_string), "%s", "APP=none"); > > > + } > > This is not much use for wedge cases that needs recovery, since at that point > > the userspace will need to clean house anyway. > > > > Which leaves us with only 'none' case and perhaps the need for standardization > > of "optional telemetry collection". > > > > Thoughts? > > I had the feeling that 'none' was already meant to be used for that. Do you > think we should move to another naming? Given that we didn't reach the merge > window yet we could potentially change that name without much damage. No, I meant thoughts on possible telemetry data that the drivers might think is useful for userspace (along with PID) and can be presented in a vendor agnostic manner (just like wedged event). Raag