Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Christian König <christian.koenig@xxxxxxx> · Thu, 8 Apr 2021 20:58:33 +0200

Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky:
On 2021-04-08 4:32 a.m., Christian König wrote:
Am 08.04.21 um 10:22 schrieb Christian König:
[SNIP]

Beyond blocking all delayed works and scheduler threads we also 
need to guarantee no  IOCTL can access MMIO post device unplug OR 
in flight IOCTLs are done before we finish pci_remove 
(amdgpu_pci_remove for us).
For this I suggest we do something like what we worked on with 
Takashi Iwai the ALSA maintainer recently when he helped 
implementing PCI BARs move support for snd_hda_intel. Take a look at
https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1&id=cbaa324799718e2b828a8c7b5b001dd896748497 
and
https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1&id=e36365d9ab5bbc30bdc221ab4b3437de34492440
We also had same issue there, how to prevent MMIO accesses while 
the BARs are migrating. What was done there is a refcount was 
added to count all IOCTLs in flight, for any in flight  IOCTL the 
BAR migration handler would
block for the refcount to drop to 0 before it would proceed, for 
any later IOCTL it stops and wait if device is in migration 
state. We even don't need the wait part, nothing to wait for, we 
just return with -ENODEV for this case.

This is essentially what the DRM SRCU is doing as well.

For the hotplug case we could do this in the toplevel since we can 
signal the fence and don't need to block memory management.

To make SRCU 'wait for' all IOCTLs in flight we would need to wrap 
every IOCTL ( practically - just drm_ioctl function) with 
drm_dev_enter/drm_dev_exit - can we do it ?

Sorry totally missed this question.

Yes, exactly that. As discussed for the hotplug case we can do this.

Thinking more about it - assuming we are  treating synchronize_srcu as 
a 'wait for completion' of any in flight {drm_dev_enter, drm_dev_exit} 
scope, some of those scopes might do dma_fence_wait inside. Since we 
haven't force signaled the fences yet we will end up a deadlock. We 
have to signal all the various fences before doing the 'wait for'. But 
we can't signal the fences before setting 'dev->unplugged = true' to 
reject further CS and other stuff which might create more fences we 
were supposed-to force signal and now missed them. Effectively setting 
'dev->unplugged = true' and doing synchronize_srcu in one call like 
drm_dev_unplug does without signalling all the fences in the device in 
between these two steps looks luck a possible deadlock to me - what do 
you think ?

Indeed, that is a really good argument to handle it the same way as the 
reset lock.

E.g. not taking it at the high level IOCTL, but rather when the frontend 
of the driver has acquired all the necessary locks (BO resv, VM lock 
etc...) before calling into the backend to actually do things with the 
hardware.

Christian.

Andrey

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx