On Thu, Oct 17, 2024 at 04:16:09PM -0300, André Almeida wrote: > Hi Raag, > > Em 30/09/2024 04:38, Raag Jadav escreveu: > > Introduce device wedged event, which will notify userspace of wedged > > (hanged/unusable) state of the DRM device through a uevent. This is > > useful especially in cases where the device is no longer operating as > > expected even after a hardware reset and has become unrecoverable from > > driver context. > > > > Purpose of this implementation is to provide drivers a generic way to > > recover with the help of userspace intervention. Different drivers may > > have different ideas of a "wedged device" depending on their hardware > > implementation, and hence the vendor agnostic nature of the event. > > It is up to the drivers to decide when they see the need for recovery > > and how they want to recover from the available methods. > > > > Current implementation defines three recovery methods, out of which, > > drivers can choose to support any one or multiple of them. Preferred > > recovery method will be sent in the uevent environment as WEDGED=<method>. > > Userspace consumers (sysadmin) can define udev rules to parse this event > > and take respective action to recover the device. > > > > =============== ================================== > > Recovery method Consumer expectations > > =============== ================================== > > rebind unbind + rebind driver > > bus-reset unbind + reset bus device + rebind > > reboot reboot system > > =============== ================================== > > > > > > I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@xxxxxxxxxx/ > > The motivation was that amdgpu was getting stuck after every GPU reset, and > there was just a black screen. The uevent would then trigger a daemon to > reset the compositor and getting things back together. As you can see in my > thread, the feature was blocked in favor of getting better overall GPU reset > from the kernel side. > > Which kind of scenarios are making i915/xe the need to have userspace > involvement? I tested a bunch of resets in i915 but never managed to get the > driver stuck. 2 scenarios: 1. Multiple levels of reset has failed and device was declared wedged. This is rare indeed as the resets improved a lot. 2. Debug case. We can boot the driver with option to declare device wedged at any timeout, so the device can be debugged. > > For the bus-reset, amdgpu does that too, but it doesn't require userspace > intervention. How do you trigger that?