On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen <ppaalanen@xxxxxxxxx> wrote: > > On Thu, 10 Mar 2022 11:56:41 -0800 > Rob Clark <robdclark@xxxxxxxxx> wrote: > > > For something like just notifying a compositor that a gpu crash > > happened, perhaps drm_event is more suitable. See > > virtio_gpu_fence_event_create() for an example of adding new event > > types. Although maybe you want it to be an event which is not device > > specific. This isn't so much of a debugging use-case as simply > > notification. > > Hi, > > for this particular use case, are we now talking about the display > device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing? > > If the former, I wasn't aware that display device crashes are a thing. > How should a userspace display server react to those? > > If the latter, don't we have EGL extensions or Vulkan API already to > deliver that? > > The above would be about device crashes that directly affect the > display server. Is that the use case in mind here, or is it instead > about notifying the display server that some application has caused a > driver/hardware crash? If the latter, how should a display server react > to that? Disconnect the application? > > Shashank, what is the actual use case you are developing this for? > > I've read all the emails here so far, and I don't recall seeing it > explained. > The idea is that a support daemon or compositor would listen for GPU reset notifications and do something useful with them (kill the guilty app, restart the desktop environment, etc.). Today when the GPU resets, most applications just continue assuming nothing is wrong, meanwhile the GPU has stopped accepting work until the apps re-init their context so all of their command submissions just get rejected. > Btw. somewhat relatedly, there has been work aiming to allow > graceful hot-unplug of DRM devices. There is a kernel doc outlining how > the various APIs should react towards userspace when a DRM device > suddenly disappears. That seems to have some overlap here IMO. > > See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug > which also has a couple pointers to EGL and Vulkan APIs. The problem is most applications don't use the GL or VK robustness APIs. You could use something like that in the compositor, but those APIs tend to be focused more on the application itself rather than the GPU in general. E.g., Is my context lost. Which is fine for restarting your context, but doesn't really help if you want to try and do something with another application (i.e., the likely guilty app). Also, on dGPU at least, when you reset the GPU, vram is usually lost (either due to the memory controller being reset, or vram being zero'd on init due to ECC support), so even if you are not the guilty process, in that case you'd need to re-init your context anyway. Alex > > > Thanks, > pq