On 2020-11-18 07:01, Christian König wrote: > Am 18.11.20 um 08:39 schrieb Daniel Vetter: >> On Tue, Nov 17, 2020 at 9:07 PM Andrey Grodzovsky >> <Andrey.Grodzovsky@xxxxxxx> wrote: >>> >>> On 11/17/20 2:49 PM, Daniel Vetter wrote: >>>> On Tue, Nov 17, 2020 at 02:18:49PM -0500, Andrey Grodzovsky wrote: >>>>> On 11/17/20 1:52 PM, Daniel Vetter wrote: >>>>>> On Tue, Nov 17, 2020 at 01:38:14PM -0500, Andrey Grodzovsky wrote: >>>>>>> On 6/22/20 5:53 AM, Daniel Vetter wrote: >>>>>>>> On Sun, Jun 21, 2020 at 02:03:08AM -0400, Andrey Grodzovsky wrote: >>>>>>>>> No point to try recovery if device is gone, just messes up things. >>>>>>>>> >>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> >>>>>>>>> --- >>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 ++++++++++++++++ >>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 ++++++++ >>>>>>>>> 2 files changed, 24 insertions(+) >>>>>>>>> >>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>>>>>>> index 6932d75..5d6d3d9 100644 >>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>>>>>>> @@ -1129,12 +1129,28 @@ static int amdgpu_pci_probe(struct pci_dev *pdev, >>>>>>>>> return ret; >>>>>>>>> } >>>>>>>>> +static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev) >>>>>>>>> +{ >>>>>>>>> + int i; >>>>>>>>> + >>>>>>>>> + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { >>>>>>>>> + struct amdgpu_ring *ring = adev->rings[i]; >>>>>>>>> + >>>>>>>>> + if (!ring || !ring->sched.thread) >>>>>>>>> + continue; >>>>>>>>> + >>>>>>>>> + cancel_delayed_work_sync(&ring->sched.work_tdr); >>>>>>>>> + } >>>>>>>>> +} >>>>>>>> I think this is a function that's supposed to be in drm/scheduler, not >>>>>>>> here. Might also just be your cleanup code being ordered wrongly, or your >>>>>>>> split in one of the earlier patches not done quite right. >>>>>>>> -Daniel >>>>>>> This function iterates across all the schedulers per amdgpu device and accesses >>>>>>> amdgpu specific structures , drm/scheduler deals with single scheduler at most >>>>>>> so looks to me like this is the right place for this function >>>>>> I guess we could keep track of all schedulers somewhere in a list in >>>>>> struct drm_device and wrap this up. That was kinda the idea. >>>>>> >>>>>> Minimally I think a tiny wrapper with docs for the >>>>>> cancel_delayed_work_sync(&sched->work_tdr); which explains what you must >>>>>> observe to make sure there's no race. >>>>> Will do >>>>> >>>>> >>>>>> I'm not exactly sure there's no >>>>>> guarantee here we won't get a new tdr work launched right afterwards at >>>>>> least, so this looks a bit like a hack. >>>>> Note that for any TDR work happening post amdgpu_cancel_all_tdr >>>>> amdgpu_job_timedout->drm_dev_is_unplugged >>>>> will return true and so it will return early. To make it water proof tight >>>>> against race >>>>> i can switch from drm_dev_is_unplugged to drm_dev_enter/exit >>>> Hm that's confusing. You do a work_cancel_sync, so that at least looks >>>> like "tdr work must not run after this point" >>>> >>>> If you only rely on drm_dev_enter/exit check with the tdr work, then >>>> there's no need to cancel anything. >>> >>> Agree, synchronize_srcu from drm_dev_unplug should play the role >>> of 'flushing' any earlier (in progress) tdr work which is >>> using drm_dev_enter/exit pair. Any later arising tdr will terminate early when >>> drm_dev_enter >>> returns false. >> Nope, anything you put into the work itself cannot close this race. >> It's the schedule_work that matters here. Or I'm missing something ... >> I thought that the tdr work you're cancelling here is launched by >> drm/scheduler code, not by the amd callback? > > Yes that is correct. Canceling the work item is not the right approach > at all, nor is adding dev_enter/exit pair in the recovery handler. > > What we need to do here is to stop the scheduler thread and then wait > for any timeout handling to have finished. > > Otherwise it can scheduler a new timeout just after we have canceled > this one. Yep, that's exactly what I said in my email above. Regards, Luben > > Regards, > Christian. > >> -Daniel >> >>> Will update. >>> >>> Andrey >>> >>> >>>> For race free cancel_work_sync you need: >>>> 1. make sure whatever is calling schedule_work is guaranteed to no longer >>>> call schedule_work. >>>> 2. call cancel_work_sync >>>> >>>> Anything else is cargo-culted work cleanup: >>>> >>>> - 1. without 2. means if a work got scheduled right before it'll still be >>>> a problem. >>>> - 2. without 1. means a schedule_work right after makes you calling >>>> cancel_work_sync pointless. >>>> >>>> So either both or nothing. >>>> -Daniel >>>> >>>>> Andrey >>>>> >>>>> >>>>>> -Daniel >>>>>> >>>>>>> Andrey >>>>>>> >>>>>>> >>>>>>>>> + >>>>>>>>> static void >>>>>>>>> amdgpu_pci_remove(struct pci_dev *pdev) >>>>>>>>> { >>>>>>>>> struct drm_device *dev = pci_get_drvdata(pdev); >>>>>>>>> + struct amdgpu_device *adev = dev->dev_private; >>>>>>>>> drm_dev_unplug(dev); >>>>>>>>> + amdgpu_cancel_all_tdr(adev); >>>>>>>>> ttm_bo_unmap_virtual_address_space(&adev->mman.bdev); >>>>>>>>> amdgpu_driver_unload_kms(dev); >>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>>>>>>> index 4720718..87ff0c0 100644 >>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>>>>>>> @@ -28,6 +28,8 @@ >>>>>>>>> #include "amdgpu.h" >>>>>>>>> #include "amdgpu_trace.h" >>>>>>>>> +#include <drm/drm_drv.h> >>>>>>>>> + >>>>>>>>> static void amdgpu_job_timedout(struct drm_sched_job *s_job) >>>>>>>>> { >>>>>>>>> struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched); >>>>>>>>> @@ -37,6 +39,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job) >>>>>>>>> memset(&ti, 0, sizeof(struct amdgpu_task_info)); >>>>>>>>> + if (drm_dev_is_unplugged(adev->ddev)) { >>>>>>>>> + DRM_INFO("ring %s timeout, but device unplugged, skipping.\n", >>>>>>>>> + s_job->sched->name); >>>>>>>>> + return; >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) { >>>>>>>>> DRM_ERROR("ring %s timeout, but soft recovered\n", >>>>>>>>> s_job->sched->name); >>>>>>>>> -- >>>>>>>>> 2.7.4 >>>>>>>>> >> >> > > _______________________________________________ > dri-devel mailing list > dri-devel@xxxxxxxxxxxxxxxxxxxxx > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&data=04%7C01%7Cluben.tuikov%40amd.com%7C4a5f8a2988214a9313ca08d88bb9aac4%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637412976842283458%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=MRw6OX1TJtA4Xpk5yg53adav0%2FYoYDUN0VLjyjJ5R%2BY%3D&reserved=0 > _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel