On Fri, Oct 18, 2024 at 11:23 AM Rodrigo Vivi <rodrigo.vivi@xxxxxxxxx> wrote: > > On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote: > > Hi Raag, > > > > Em 30/09/2024 04:38, Raag Jadav escreveu: > > > Introduce device wedged event, which will notify userspace of wedged > > > (hanged/unusable) state of the DRM device through a uevent. This is > > > useful especially in cases where the device is no longer operating as > > > expected even after a hardware reset and has become unrecoverable from > > > driver context. > > > > > > Purpose of this implementation is to provide drivers a generic way to > > > recover with the help of userspace intervention. Different drivers may > > > have different ideas of a "wedged device" depending on their hardware > > > implementation, and hence the vendor agnostic nature of the event. > > > It is up to the drivers to decide when they see the need for recovery > > > and how they want to recover from the available methods. > > > > > > Current implementation defines three recovery methods, out of which, > > > drivers can choose to support any one or multiple of them. Preferred > > > recovery method will be sent in the uevent environment as WEDGED=<method>. > > > Userspace consumers (sysadmin) can define udev rules to parse this event > > > and take respective action to recover the device. > > > > > > =============== ================================== > > > Recovery method Consumer expectations > > > =============== ================================== > > > rebind unbind + rebind driver > > > bus-reset unbind + reset bus device + rebind > > > reboot reboot system > > > =============== ================================== > > > > > > > > > > I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@xxxxxxxxxx/ > > > > The motivation was that amdgpu was getting stuck after every GPU reset, and > > there was just a black screen. The uevent would then trigger a daemon to > > reset the compositor and getting things back together. As you can see in my > > thread, the feature was blocked in favor of getting better overall GPU reset > > from the kernel side. > > > > Which kind of scenarios are making i915/xe the need to have userspace > > involvement? I tested a bunch of resets in i915 but never managed to get the > > driver stuck. > > 2 scenarios: > > 1. Multiple levels of reset has failed and device was declared wedged. This is > rare indeed as the resets improved a lot. > 2. Debug case. We can boot the driver with option to declare device wedged at > any timeout, so the device can be debugged. > > > > > For the bus-reset, amdgpu does that too, but it doesn't require userspace > > intervention. > > How do you trigger that? What do you mean by bus reset? I think Chrisitian is just referring to a full adapter reset (as opposed to a queue reset or something more fine grained). Driver can reset the device via MMIO or firmware, depending on the device. I think there are also PCI helpers for things like PCI FLR. Alex