Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> · Fri, 9 Apr 2021 11:42:55 -0400

On 2021-04-09 3:01 a.m., Christian König wrote:
Am 09.04.21 um 08:53 schrieb Christian König:
Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky:

On 2021-04-08 2:58 p.m., Christian König wrote:
Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky:
On 2021-04-08 4:32 a.m., Christian König wrote:
Am 08.04.21 um 10:22 schrieb Christian König:
[SNIP]

Beyond blocking all delayed works and scheduler threads we 
also need to guarantee no  IOCTL can access MMIO post device 
unplug OR in flight IOCTLs are done before we finish 
pci_remove (amdgpu_pci_remove for us).
For this I suggest we do something like what we worked on 
with Takashi Iwai the ALSA maintainer recently when he helped 
implementing PCI BARs move support for snd_hda_intel. Take a 
look at
https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Dyadro%2Fpcie_hotplug%2Fmovable_bars_v9.1%26id%3Dcbaa324799718e2b828a8c7b5b001dd896748497&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C1c5e440d332f46b7f86208d8fb25422c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637535484734581904%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=n%2FG3bLYUKdl9mitR9f1a8qLpkToLdKM3Iz4y23GFg60%3D&amp;reserved=0 
and
https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Dyadro%2Fpcie_hotplug%2Fmovable_bars_v9.1%26id%3De36365d9ab5bbc30bdc221ab4b3437de34492440&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C1c5e440d332f46b7f86208d8fb25422c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637535484734581904%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=xI88SgbdAK%2FUmCC3JOvAknFTdbDbfu4AIPL%2Bf8ol4ZI%3D&amp;reserved=0 

We also had same issue there, how to prevent MMIO accesses 
while the BARs are migrating. What was done there is a 
refcount was added to count all IOCTLs in flight, for any in 
flight  IOCTL the BAR migration handler would
block for the refcount to drop to 0 before it would proceed, 
for any later IOCTL it stops and wait if device is in 
migration state. We even don't need the wait part, nothing to 
wait for, we just return with -ENODEV for this case.

This is essentially what the DRM SRCU is doing as well.

For the hotplug case we could do this in the toplevel since we 
can signal the fence and don't need to block memory management.

To make SRCU 'wait for' all IOCTLs in flight we would need to 
wrap every IOCTL ( practically - just drm_ioctl function) with 
drm_dev_enter/drm_dev_exit - can we do it ?

Sorry totally missed this question.

Yes, exactly that. As discussed for the hotplug case we can do this.

Thinking more about it - assuming we are  treating 
synchronize_srcu as a 'wait for completion' of any in flight 
{drm_dev_enter, drm_dev_exit} scope, some of those scopes might do 
dma_fence_wait inside. Since we haven't force signaled the fences 
yet we will end up a deadlock. We have to signal all the various 
fences before doing the 'wait for'. But we can't signal the fences 
before setting 'dev->unplugged = true' to reject further CS and 
other stuff which might create more fences we were supposed-to 
force signal and now missed them. Effectively setting 
'dev->unplugged = true' and doing synchronize_srcu in one call 
like drm_dev_unplug does without signalling all the fences in the 
device in between these two steps looks luck a possible deadlock 
to me - what do you think ?

Indeed, that is a really good argument to handle it the same way as 
the reset lock.

E.g. not taking it at the high level IOCTL, but rather when the 
frontend of the driver has acquired all the necessary locks (BO 
resv, VM lock etc...) before calling into the backend to actually 
do things with the hardware.

Christian.

From what you said I understand that you want to solve this problem 
by using drm_dev_enter/exit brackets low enough in the code such 
that it will not include and fence wait.

But inserting dmr_dev_enter/exit on the highest level in drm_ioctl 
is much less effort and less room for error then going through each 
IOCTL and trying to identify at what point (possibly multiple 
points) they are about to access HW, some of this is hidden deep in 
HAL layers such as DC layer in display driver or the multi layers of 
powerplay/SMU libraries. Also, we can't only limit our-self to 
back-end if by this you mean ASIC specific functions which access 
registers. We also need to take care of any MMIO kernel BO (VRAM 
BOs) where we may access directly MMIO space by pointer from the 
front end of the driver (HW agnostic) and TTM/DRM layers.

Exactly, yes. The key point is we need to identify such places anyway 
for GPU reset to work properly. So we could just piggy back hotplug 
on top of that work and are done.

I see most of this was done By Denis in this patch 
https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=drm-misc-next&id=df9c8d1aa278c435c30a69b8f2418b4a52fcb929, 
indeed this doesn't cover the direct by pointer accesses of MMIO and 
will introduce much more of those and, as people write new code, new 
places to cover will pop up leading to regressions and extra work to 
fix. It would be really much better if we could blanket cover it at the 
very top  such as root of all IOCTLs or, for any queued work/timer at 
the very top function, to handle it once and for all.

Our problem here is how to signal all the existing  fences on one 
hand and on the other prevent any new dma_fence waits after we 
finished signaling existing fences. Once we solved this then there 
is no problem using drm_dev_unplug in conjunction with 
drm_dev_enter/exit at the highest level of drm_ioctl to flush any 
IOCTLs in flight and block any new ones.

IMHO when we speak about signalling all fences we don't mean ALL the 
currently existing dma_fence structs (they are spread all over the 
place) but rather signal all the HW fences because HW is what's gone 
and we can't expect for those fences to be ever signaled. All the 
rest such as: scheduler fences, user fences, drm_gem reservation 
objects e.t.c. are either dependent on those HW fences and hence 
signaling the HW fences will in turn signal them or, are not 
impacted by the HW being gone and hence can still be waited on and 
will complete. If this assumption is correct then I think that we 
should use some flag to prevent any new submission to HW which 
creates HW fences (somewhere around amdgpu_fence_emit), then 
traverse all existing HW fences (currently they are spread in a few 
places so maybe we need to track them in a list) and signal them. 
After that it's safe to cal drm_dev_unplug and be sure 
synchronize_srcu won't stall because of of dma_fence_wait. After 
that we can proceed to canceling work items, stopping schedulers e.t.c.

That is problematic as well since you need to make sure that the 
scheduler is not creating a new hardware fence in the moment you try 
to signal all of them. It would require another SRCU or lock for this.

If we use a list and a flag called 'emit_allowed' under a lock such that 
in amdgpu_fence_emit we lock the list, check the flag and if true add 
the new HW fence to list and proceed to HW emition as normal, otherwise 
return with -ENODEV. In amdgpu_pci_remove we take the lock, set the flag 
to false, and then iterate the list and force signal it. Will this not 
prevent any new HW fence creation from now on from any place trying to 
do so ?

Alternatively grabbing the reset write side and stopping and then 
restarting the scheduler could work as well.

Christian.

I didn't get the above and I don't see why I need to reuse the GPU reset 
rw_lock. I rely on the SRCU unplug flag for unplug. Also, not clear to 
me why are we focusing on the scheduler threads, any code patch to 
generate HW fences should be covered, so any code leading to 
amdgpu_fence_emit needs to be taken into account such as, direct IB 
submissions, VM flushes e.t.c

Andrey

Christian.

Andrey

Andrey

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx