Re: [PATCH v7 1/5] drm: Introduce device wedged event

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote:
> Hi Raag,
> 
> Em 30/09/2024 04:38, Raag Jadav escreveu:
> > Introduce device wedged event, which will notify userspace of wedged
> > (hanged/unusable) state of the DRM device through a uevent. This is
> > useful especially in cases where the device is no longer operating as
> > expected even after a hardware reset and has become unrecoverable from
> > driver context.
> > 
> > Purpose of this implementation is to provide drivers a generic way to
> > recover with the help of userspace intervention. Different drivers may
> > have different ideas of a "wedged device" depending on their hardware
> > implementation, and hence the vendor agnostic nature of the event.
> > It is up to the drivers to decide when they see the need for recovery
> > and how they want to recover from the available methods.
> > 
> > Current implementation defines three recovery methods, out of which,
> > drivers can choose to support any one or multiple of them. Preferred
> > recovery method will be sent in the uevent environment as WEDGED=<method>.
> > Userspace consumers (sysadmin) can define udev rules to parse this event
> > and take respective action to recover the device.
> > 
> >      =============== ==================================
> >      Recovery method Consumer expectations
> >      =============== ==================================
> >      rebind          unbind + rebind driver
> >      bus-reset       unbind + reset bus device + rebind
> >      reboot          reboot system
> >      =============== ==================================
> > 
> > 
> 
> I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@xxxxxxxxxx/
> 
> The motivation was that amdgpu was getting stuck after every GPU reset, and
> there was just a black screen. The uevent would then trigger a daemon to
> reset the compositor and getting things back together. As you can see in my
> thread, the feature was blocked in favor of getting better overall GPU reset
> from the kernel side.
> 
> Which kind of scenarios are making i915/xe the need to have userspace
> involvement? I tested a bunch of resets in i915 but never managed to get the
> driver stuck.

2 scenarios:

1. Multiple levels of reset has failed and device was declared wedged. This is
rare indeed as the resets improved a lot.
2. Debug case. We can boot the driver with option to declare device wedged at
any timeout, so the device can be debugged.

> 
> For the bus-reset, amdgpu does that too, but it doesn't require userspace
> intervention.

How do you trigger that?



[Index of Archives]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux