Re: [PATCH v7 1/5] drm: Introduce device wedged event

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Raag,

Em 30/09/2024 04:38, Raag Jadav escreveu:
Introduce device wedged event, which will notify userspace of wedged
(hanged/unusable) state of the DRM device through a uevent. This is
useful especially in cases where the device is no longer operating as
expected even after a hardware reset and has become unrecoverable from
driver context.

Purpose of this implementation is to provide drivers a generic way to
recover with the help of userspace intervention. Different drivers may
have different ideas of a "wedged device" depending on their hardware
implementation, and hence the vendor agnostic nature of the event.
It is up to the drivers to decide when they see the need for recovery
and how they want to recover from the available methods.

Current implementation defines three recovery methods, out of which,
drivers can choose to support any one or multiple of them. Preferred
recovery method will be sent in the uevent environment as WEDGED=<method>.
Userspace consumers (sysadmin) can define udev rules to parse this event
and take respective action to recover the device.

     =============== ==================================
     Recovery method Consumer expectations
     =============== ==================================
     rebind          unbind + rebind driver
     bus-reset       unbind + reset bus device + rebind
     reboot          reboot system
     =============== ==================================



I proposed something similar in the past: https://lore.kernel.org/dri-devel/20221125175203.52481-1-andrealmeid@xxxxxxxxxx/

The motivation was that amdgpu was getting stuck after every GPU reset, and there was just a black screen. The uevent would then trigger a daemon to reset the compositor and getting things back together. As you can see in my thread, the feature was blocked in favor of getting better overall GPU reset from the kernel side.

Which kind of scenarios are making i915/xe the need to have userspace involvement? I tested a bunch of resets in i915 but never managed to get the driver stuck.

For the bus-reset, amdgpu does that too, but it doesn't require userspace intervention.



[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux