On Thu, Nov 19, 2020 at 10:02:28AM -0500, Andrey Grodzovsky wrote: > > On 11/19/20 2:55 AM, Christian König wrote: > > Am 18.11.20 um 17:20 schrieb Andrey Grodzovsky: > > > > > > On 11/18/20 7:01 AM, Christian König wrote: > > > > Am 18.11.20 um 08:39 schrieb Daniel Vetter: > > > > > On Tue, Nov 17, 2020 at 9:07 PM Andrey Grodzovsky > > > > > <Andrey.Grodzovsky@xxxxxxx> wrote: > > > > > > > > > > > > On 11/17/20 2:49 PM, Daniel Vetter wrote: > > > > > > > On Tue, Nov 17, 2020 at 02:18:49PM -0500, Andrey Grodzovsky wrote: > > > > > > > > On 11/17/20 1:52 PM, Daniel Vetter wrote: > > > > > > > > > On Tue, Nov 17, 2020 at 01:38:14PM -0500, Andrey Grodzovsky wrote: > > > > > > > > > > On 6/22/20 5:53 AM, Daniel Vetter wrote: > > > > > > > > > > > On Sun, Jun 21, 2020 at 02:03:08AM -0400, Andrey Grodzovsky wrote: > > > > > > > > > > > > No point to try recovery if device is gone, just messes up things. > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> > > > > > > > > > > > > --- > > > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 ++++++++++++++++ > > > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 ++++++++ > > > > > > > > > > > > 2 files changed, 24 insertions(+) > > > > > > > > > > > > > > > > > > > > > > > > diff --git > > > > > > > > > > > > a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > > > > > > > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > > > > > > > > > > > index 6932d75..5d6d3d9 100644 > > > > > > > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > > > > > > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > > > > > > > > > > > @@ -1129,12 +1129,28 @@ static > > > > > > > > > > > > int amdgpu_pci_probe(struct > > > > > > > > > > > > pci_dev *pdev, > > > > > > > > > > > > return ret; > > > > > > > > > > > > } > > > > > > > > > > > > +static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev) > > > > > > > > > > > > +{ > > > > > > > > > > > > + int i; > > > > > > > > > > > > + > > > > > > > > > > > > + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { > > > > > > > > > > > > + struct amdgpu_ring *ring = adev->rings[i]; > > > > > > > > > > > > + > > > > > > > > > > > > + if (!ring || !ring->sched.thread) > > > > > > > > > > > > + continue; > > > > > > > > > > > > + > > > > > > > > > > > > + cancel_delayed_work_sync(&ring->sched.work_tdr); > > > > > > > > > > > > + } > > > > > > > > > > > > +} > > > > > > > > > > > I think this is a function that's supposed to be in drm/scheduler, not > > > > > > > > > > > here. Might also just be your > > > > > > > > > > > cleanup code being ordered wrongly, > > > > > > > > > > > or your > > > > > > > > > > > split in one of the earlier patches not done quite right. > > > > > > > > > > > -Daniel > > > > > > > > > > This function iterates across all the > > > > > > > > > > schedulers per amdgpu device and > > > > > > > > > > accesses > > > > > > > > > > amdgpu specific structures , > > > > > > > > > > drm/scheduler deals with single > > > > > > > > > > scheduler at most > > > > > > > > > > so looks to me like this is the right place for this function > > > > > > > > > I guess we could keep track of all schedulers somewhere in a list in > > > > > > > > > struct drm_device and wrap this up. That was kinda the idea. > > > > > > > > > > > > > > > > > > Minimally I think a tiny wrapper with docs for the > > > > > > > > > cancel_delayed_work_sync(&sched->work_tdr); which explains what you must > > > > > > > > > observe to make sure there's no race. > > > > > > > > Will do > > > > > > > > > > > > > > > > > > > > > > > > > I'm not exactly sure there's no > > > > > > > > > guarantee here we won't get a new tdr work launched right afterwards at > > > > > > > > > least, so this looks a bit like a hack. > > > > > > > > Note that for any TDR work happening post amdgpu_cancel_all_tdr > > > > > > > > amdgpu_job_timedout->drm_dev_is_unplugged > > > > > > > > will return true and so it will return early. To make it water proof tight > > > > > > > > against race > > > > > > > > i can switch from drm_dev_is_unplugged to drm_dev_enter/exit > > > > > > > Hm that's confusing. You do a work_cancel_sync, so that at least looks > > > > > > > like "tdr work must not run after this point" > > > > > > > > > > > > > > If you only rely on drm_dev_enter/exit check with the tdr work, then > > > > > > > there's no need to cancel anything. > > > > > > > > > > > > Agree, synchronize_srcu from drm_dev_unplug should play the role > > > > > > of 'flushing' any earlier (in progress) tdr work which is > > > > > > using drm_dev_enter/exit pair. Any later arising tdr > > > > > > will terminate early when > > > > > > drm_dev_enter > > > > > > returns false. > > > > > Nope, anything you put into the work itself cannot close this race. > > > > > It's the schedule_work that matters here. Or I'm missing something ... > > > > > I thought that the tdr work you're cancelling here is launched by > > > > > drm/scheduler code, not by the amd callback? > > > > > > > > > My bad, you are right, I am supposed to put drm_dev_enter.exit pair > > > into drm_sched_job_timedout > > > > > > > > > > > > > > Yes that is correct. Canceling the work item is not the right > > > > approach at all, nor is adding dev_enter/exit pair in the > > > > recovery handler. > > > > > > > > > Without adding the dev_enter/exit guarding pair in the recovery > > > handler you are ending up with GPU reset starting while > > > the device is already unplugged, this leads to multiple errors and general mess. > > > > > > > > > > > > > > What we need to do here is to stop the scheduler thread and then > > > > wait for any timeout handling to have finished. > > > > > > > > Otherwise it can scheduler a new timeout just after we have canceled this one. > > > > > > > > Regards, > > > > Christian. > > > > > > > > > Schedulers are stopped from amdgpu_driver_unload_kms which indeed > > > happens after drm_dev_unplug > > > so yes, there is still a chance for new work being scheduler and > > > timeout armed after but, once i fix the code > > > to place drm_dev_enter/exit pair into drm_sched_job_timeout I don't > > > see why that not a good solution ? > > > > Yeah that should work as well, but then you also don't need to cancel > > the work item from the driver. > > > Indeed, as Daniel pointed out no need and I dropped it. One correction - I > previously said that w/o > dev_enter/exit guarding pair in scheduler's TO handler you will get GPU > reset starting while device already gone - > of course this is not fully preventing this as the device can be extracted > at any moment just after we > already entered GPU recovery. But it does saves us processing a futile GPU > recovery which always > starts once you unplug the device if there are active gobs in progress at > the moment and so I think it's > still justifiable to keep the dev_enter/exit guarding pair there. Yeah sprinkling drm_dev_enter/exit over the usual suspect code paths like tdr to make the entire unloading much faster makes sense. Waiting for enormous amounts of mmio ops to time out isn't fun. A comment might be good for that though, to explain why we're doing that. -Daniel > > Andrey > > > > > > > > > Any tdr work started after drm_dev_unplug finished will simply abort > > > on entry to drm_sched_job_timedout > > > because drm_dev_enter will be false and the function will return > > > without rearming the timeout timer and > > > so will have no impact. > > > > > > The only issue i see here now is of possible use after free if some > > > late tdr work will try to execute after > > > drm device already gone, for this we probably should add > > > cancel_delayed_work_sync(sched.work_tdr) > > > to drm_sched_fini after sched->thread is stopped there. > > > > Good point, that is indeed missing as far as I can see. > > > > Christian. > > > > > > > > Andrey > > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel