Re: [PATCH v7 1/5] drm: Introduce device wedged event

Raag Jadav <raag.jadav@xxxxxxxxx> · Sat, 19 Oct 2024 22:08:45 +0300

On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote:
> Hi Raag,
> 
> Em 30/09/2024 04:38, Raag Jadav escreveu:
> > Introduce device wedged event, which will notify userspace of wedged
> > (hanged/unusable) state of the DRM device through a uevent. This is
> > useful especially in cases where the device is no longer operating as
> > expected even after a hardware reset and has become unrecoverable from
> > driver context.
> > 
> > Purpose of this implementation is to provide drivers a generic way to
> > recover with the help of userspace intervention. Different drivers may
> > have different ideas of a "wedged device" depending on their hardware
> > implementation, and hence the vendor agnostic nature of the event.
> > It is up to the drivers to decide when they see the need for recovery
> > and how they want to recover from the available methods.
> > 
> > Current implementation defines three recovery methods, out of which,
> > drivers can choose to support any one or multiple of them. Preferred
> > recovery method will be sent in the uevent environment as WEDGED=<method>.
> > Userspace consumers (sysadmin) can define udev rules to parse this event
> > and take respective action to recover the device.
> > 
> >      =============== ==================================
> >      Recovery method Consumer expectations
> >      =============== ==================================
> >      rebind          unbind + rebind driver
> >      bus-reset       unbind + reset bus device + rebind
> >      reboot          reboot system
> >      =============== ==================================
> > 
> > 
> 
> I proposed something similar in the past:
> https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@xxxxxxxxxx/

Thanks for sharing. I went through it and I think we can use some of the ideas
with generic adaption.

While we can always execute scripts on uevent, it'd be good to have a userspace
daemon applying automated policies for wedge cases based on admin/user needs.
This way we can also manage repeat offenders.

Xe has devcoredump so telemetry would also be a nice addition.

Great opportunity to collaborate here.

> The motivation was that amdgpu was getting stuck after every GPU reset, and
> there was just a black screen. The uevent would then trigger a daemon to
> reset the compositor and getting things back together. As you can see in my
> thread, the feature was blocked in favor of getting better overall GPU reset
> from the kernel side.

We have hardware level resets but (although rare) they're also prone to failure.
We do what we can to recover from driver context but it adds on to the complexity
overtime. Something like wedging, if done right, would be much more robust IMHO.

Raag