Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

Christian König <christian.koenig@xxxxxxx> · Tue, 15 Mar 2022 08:25:48 +0100

Am 15.03.22 um 08:13 schrieb Dave Airlie:
On Tue, 15 Mar 2022 at 00:23, Alex Deucher <alexdeucher@xxxxxxxxx> wrote:
On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen <ppaalanen@xxxxxxxxx> wrote:
On Thu, 10 Mar 2022 11:56:41 -0800
Rob Clark <robdclark@xxxxxxxxx> wrote:

For something like just notifying a compositor that a gpu crash
happened, perhaps drm_event is more suitable.  See
virtio_gpu_fence_event_create() for an example of adding new event
types.  Although maybe you want it to be an event which is not device
specific.  This isn't so much of a debugging use-case as simply
notification.
Hi,

for this particular use case, are we now talking about the display
device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?

If the former, I wasn't aware that display device crashes are a thing.
How should a userspace display server react to those?

If the latter, don't we have EGL extensions or Vulkan API already to
deliver that?

The above would be about device crashes that directly affect the
display server. Is that the use case in mind here, or is it instead
about notifying the display server that some application has caused a
driver/hardware crash? If the latter, how should a display server react
to that? Disconnect the application?

Shashank, what is the actual use case you are developing this for?

I've read all the emails here so far, and I don't recall seeing it
explained.

The idea is that a support daemon or compositor would listen for GPU
reset notifications and do something useful with them (kill the guilty
app, restart the desktop environment, etc.).  Today when the GPU
resets, most applications just continue assuming nothing is wrong,
meanwhile the GPU has stopped accepting work until the apps re-init
their context so all of their command submissions just get rejected.
Just one thing comes to mind reading this, racy PID reuse.

process 1234 does something bad to GPU.
process 1234 dies in parallel to sysfs notification being sent.
other process 1234 reuses the pid
new process 1234 gets destroyed by receiver of sysfs notification.

That's a well known problem inherit to the uses of PIDs.

IIRC because of this the kernel only reuses PIDs when 
/proc/sys/kernel/pid_max is reached and then wraps around.

Regards,
Christian.

Dave.