On Tue, Apr 13, 2021 at 9:10 AM Christian König <ckoenig.leichtzumerken@xxxxxxxxx> wrote: > > Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: > > > > On 2021-04-12 3:18 p.m., Christian König wrote: > >> Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: > >>> [SNIP] > >>>>> > >>>>> So what's the right approach ? How we guarantee that when running > >>>>> amdgpu_fence_driver_force_completion we will signal all the HW > >>>>> fences and not racing against some more fences insertion into that > >>>>> array ? > >>>>> > >>>> > >>>> Well I would still say the best approach would be to insert this > >>>> between the front end and the backend and not rely on signaling > >>>> fences while holding the device srcu. > >>> > >>> > >>> My question is, even now, when we run > >>> amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or > >>> amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, > >>> what there prevents a race with another fence being at the same time > >>> emitted and inserted into the fence array ? Looks like nothing. > >>> > >> > >> Each ring can only be used by one thread at the same time, this > >> includes emitting fences as well as other stuff. > >> > >> During GPU reset we make sure nobody writes to the rings by stopping > >> the scheduler and taking the GPU reset lock (so that nobody else can > >> start the scheduler again). > > > > > > What about direct submissions not through scheduler - > > amdgpu_job_submit_direct, I don't see how this is protected. > > Those only happen during startup and GPU reset. > > >> > >>>> > >>>> BTW: Could it be that the device SRCU protects more than one device > >>>> and we deadlock because of this? > >>> > >>> > >>> I haven't actually experienced any deadlock until now but, yes, > >>> drm_unplug_srcu is defined as static in drm_drv.c and so in the > >>> presence of multiple devices from same or different drivers we in > >>> fact are dependent on all their critical sections i guess. > >>> > >> > >> Shit, yeah the devil is a squirrel. So for A+I laptops we actually > >> need to sync that up with Daniel and the rest of the i915 guys. > >> > >> IIRC we could actually have an amdgpu device in a docking station > >> which needs hotplug and the driver might depend on waiting for the > >> i915 driver as well. > > > > > > Can't we propose a patch to make drm_unplug_srcu per drm_device ? I > > don't see why it has to be global and not per device thing. > > I'm really wondering the same thing for quite a while now. > > Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global. SRCU isn't exactly the cheapest thing, but aside from that we could make it per-device. I'm not seeing the point much since if you do end up being stuck on an ioctl this might happen with anything really. Also note that dma_fence_waits are supposed to be time bound, so you shouldn't end up waiting on them forever. It should all get sorted out in due time with TDR I hope (e.g. if i915 is stuck on a fence because you're unlucky). -Daniel > > Regards, > Christian. > > > > > Andrey > > > > > >> > >> Christian. > >> > >>> Andrey > >>> > >>> > >>>> > >>>> Christian. > >>>> > >>>>> Andrey > >>>>> > >>>>> > >>>>>> > >>>>>>> Andrey > >>>>>>> > >>>>>>> > >>>>>>>> > >>>>>>>> Christian. > >>>>>>>> > >>>>>>>>> /* Past this point no more fence are submitted to HW ring > >>>>>>>>> and hence we can safely call force signal on all that are > >>>>>>>>> currently there. > >>>>>>>>> * Any subsequently created HW fences will be returned > >>>>>>>>> signaled with an error code right away > >>>>>>>>> */ > >>>>>>>>> > >>>>>>>>> for_each_ring(adev) > >>>>>>>>> amdgpu_fence_process(ring) > >>>>>>>>> > >>>>>>>>> drm_dev_unplug(dev); > >>>>>>>>> Stop schedulers > >>>>>>>>> cancel_sync(all timers and queued works); > >>>>>>>>> hw_fini > >>>>>>>>> unmap_mmio > >>>>>>>>> > >>>>>>>>> } > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Andrey > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Alternatively grabbing the reset write side and stopping > >>>>>>>>>>>>>> and then restarting the scheduler could work as well. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Christian. > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> I didn't get the above and I don't see why I need to reuse > >>>>>>>>>>>>> the GPU reset rw_lock. I rely on the SRCU unplug flag for > >>>>>>>>>>>>> unplug. Also, not clear to me why are we focusing on the > >>>>>>>>>>>>> scheduler threads, any code patch to generate HW fences > >>>>>>>>>>>>> should be covered, so any code leading to > >>>>>>>>>>>>> amdgpu_fence_emit needs to be taken into account such as, > >>>>>>>>>>>>> direct IB submissions, VM flushes e.t.c > >>>>>>>>>>>> > >>>>>>>>>>>> You need to work together with the reset lock anyway, cause > >>>>>>>>>>>> a hotplug could run at the same time as a reset. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> For going my way indeed now I see now that I have to take > >>>>>>>>>>> reset write side lock during HW fences signalling in order > >>>>>>>>>>> to protect against scheduler/HW fences detachment and > >>>>>>>>>>> reattachment during schedulers stop/restart. But if we go > >>>>>>>>>>> with your approach then calling drm_dev_unplug and scoping > >>>>>>>>>>> amdgpu_job_timeout with drm_dev_enter/exit should be enough > >>>>>>>>>>> to prevent any concurrent GPU resets during unplug. In fact > >>>>>>>>>>> I already do it anyway - > >>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ceefa9c90ed8c405ec3b708d8fc46daaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637536728550884740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=UiNaJE%2BH45iYmbwSDnMSKZS5z0iak0fNlbbfYqKS2Jo%3D&reserved=0 > >>>>>>>>>> > >>>>>>>>>> Yes, good point as well. > >>>>>>>>>> > >>>>>>>>>> Christian. > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Andrey > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Christian. > >>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> Andrey > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Christian. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Andrey > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Andrey > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>> > >>>> > >> > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx