Re: [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged.

Andrey Grodzovsky <Andrey.Grodzovsky@xxxxxxx> · Thu, 19 Nov 2020 16:24:57 -0500

On 11/19/20 10:29 AM, Daniel Vetter wrote:
On Thu, Nov 19, 2020 at 10:02:28AM -0500, Andrey Grodzovsky wrote:
On 11/19/20 2:55 AM, Christian König wrote:
Am 18.11.20 um 17:20 schrieb Andrey Grodzovsky:
On 11/18/20 7:01 AM, Christian König wrote:
Am 18.11.20 um 08:39 schrieb Daniel Vetter:
On Tue, Nov 17, 2020 at 9:07 PM Andrey Grodzovsky
<Andrey.Grodzovsky@xxxxxxx> wrote:
On 11/17/20 2:49 PM, Daniel Vetter wrote:
On Tue, Nov 17, 2020 at 02:18:49PM -0500, Andrey Grodzovsky wrote:
On 11/17/20 1:52 PM, Daniel Vetter wrote:
On Tue, Nov 17, 2020 at 01:38:14PM -0500, Andrey Grodzovsky wrote:
On 6/22/20 5:53 AM, Daniel Vetter wrote:
On Sun, Jun 21, 2020 at 02:03:08AM -0400, Andrey Grodzovsky wrote:
No point to try recovery if device is gone, just messes up things.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
---
      drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 ++++++++++++++++
      drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 ++++++++
      2 files changed, 24 insertions(+)

diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 6932d75..5d6d3d9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1129,12 +1129,28 @@ static
int amdgpu_pci_probe(struct
pci_dev *pdev,
           return ret;
      }
+static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
+{
+        int i;
+
+        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+                struct amdgpu_ring *ring = adev->rings[i];
+
+                if (!ring || !ring->sched.thread)
+                        continue;
+
+ cancel_delayed_work_sync(&ring->sched.work_tdr);
+        }
+}
I think this is a function that's supposed to be in drm/scheduler, not
here. Might also just be your
cleanup code being ordered wrongly,
or your
split in one of the earlier patches not done quite right.
-Daniel
This function iterates across all the
schedulers per amdgpu device and
accesses
amdgpu specific structures ,
drm/scheduler deals with single
scheduler at most
so looks to me like this is the right place for this function
I guess we could keep track of all schedulers somewhere in a list in
struct drm_device and wrap this up. That was kinda the idea.

Minimally I think a tiny wrapper with docs for the
cancel_delayed_work_sync(&sched->work_tdr); which explains what you must
observe to make sure there's no race.
Will do


I'm not exactly sure there's no
guarantee here we won't get a new tdr work launched right afterwards at
least, so this looks a bit like a hack.
Note that for any TDR work happening post amdgpu_cancel_all_tdr
amdgpu_job_timedout->drm_dev_is_unplugged
will return true and so it will return early. To make it water proof tight
against race
i can switch from drm_dev_is_unplugged to drm_dev_enter/exit
Hm that's confusing. You do a work_cancel_sync, so that at least looks
like "tdr work must not run after this point"

If you only rely on drm_dev_enter/exit check with the tdr work, then
there's no need to cancel anything.
Agree, synchronize_srcu from drm_dev_unplug should play the role
of 'flushing' any earlier (in progress) tdr work which is
using drm_dev_enter/exit pair. Any later arising tdr
will terminate early when
drm_dev_enter
returns false.
Nope, anything you put into the work itself cannot close this race.
It's the schedule_work that matters here. Or I'm missing something ...
I thought that the tdr work you're cancelling here is launched by
drm/scheduler code, not by the amd callback?

My bad, you are right, I am supposed to put drm_dev_enter.exit pair
into drm_sched_job_timedout


Yes that is correct. Canceling the work item is not the right
approach at all, nor is adding dev_enter/exit pair in the
recovery handler.

Without adding the dev_enter/exit guarding pair in the recovery
handler you are ending up with GPU reset starting while
the device is already unplugged, this leads to multiple errors and general mess.


What we need to do here is to stop the scheduler thread and then
wait for any timeout handling to have finished.

Otherwise it can scheduler a new timeout just after we have canceled this one.

Regards,
Christian.

Schedulers are stopped from amdgpu_driver_unload_kms which indeed
happens after drm_dev_unplug
so yes, there is still a chance for new work being scheduler and
timeout armed after but, once i fix the code
to place drm_dev_enter/exit pair into drm_sched_job_timeout I don't
see why that not a good solution ?
Yeah that should work as well, but then you also don't need to cancel
the work item from the driver.

Indeed, as Daniel pointed out no need and I dropped it. One correction - I
previously said that w/o
dev_enter/exit guarding pair in scheduler's TO handler you will get GPU
reset starting while device already gone -
of course this is not fully preventing this as the device can be extracted
at any moment just after we
already entered GPU recovery. But it does saves us processing a futile  GPU
recovery which always
starts once you unplug the device if there are active gobs in progress at
the moment and so I think it's
still justifiable to keep the dev_enter/exit guarding pair there.
Yeah sprinkling drm_dev_enter/exit over the usual suspect code paths like
tdr to make the entire unloading much faster makes sense. Waiting for
enormous amounts of mmio ops to time out isn't fun. A comment might be
good for that though, to explain why we're doing that.
-Daniel


Will do, I also tried to insert drm_dev_enter/exit in all MMIO accessors in amdgpu
to try and avoid at that level but didn't get good results for unclear reason, 
will probably get
to this as a follow up work to again avoid expanding the scope of current work 
too much.

Andrey



Andrey



Any tdr work started after drm_dev_unplug finished will simply abort
on entry to drm_sched_job_timedout
because drm_dev_enter will be false and the function will return
without rearming the timeout timer and
so will have no impact.

The only issue i see here now is of possible use after free if some
late tdr work will try to execute after
drm device already gone, for this we probably should add
cancel_delayed_work_sync(sched.work_tdr)
to drm_sched_fini after sched->thread is stopped there.
Good point, that is indeed missing as far as I can see.

Christian.

Andrey
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx