On Fri, Mar 18, 2022 at 9:27 AM Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> wrote: > > > On 2022-03-18 12:20, Rob Clark wrote: > > On Fri, Mar 18, 2022 at 9:04 AM Andrey Grodzovsky > > <andrey.grodzovsky@xxxxxxx> wrote: > >> > >> On 2022-03-17 16:35, Rob Clark wrote: > >>> On Thu, Mar 17, 2022 at 12:50 PM Andrey Grodzovsky > >>> <andrey.grodzovsky@xxxxxxx> wrote: > >>>> On 2022-03-17 14:25, Rob Clark wrote: > >>>>> On Thu, Mar 17, 2022 at 11:10 AM Andrey Grodzovsky > >>>>> <andrey.grodzovsky@xxxxxxx> wrote: > >>>>>> On 2022-03-17 13:35, Rob Clark wrote: > >>>>>>> On Thu, Mar 17, 2022 at 9:45 AM Christian König > >>>>>>> <christian.koenig@xxxxxxx> wrote: > >>>>>>>> Am 17.03.22 um 17:18 schrieb Rob Clark: > >>>>>>>>> On Thu, Mar 17, 2022 at 9:04 AM Christian König > >>>>>>>>> <christian.koenig@xxxxxxx> wrote: > >>>>>>>>>> Am 17.03.22 um 16:10 schrieb Rob Clark: > >>>>>>>>>>> [SNIP] > >>>>>>>>>>> userspace frozen != kthread frozen .. that is what this patch is > >>>>>>>>>>> trying to address, so we aren't racing between shutting down the hw > >>>>>>>>>>> and the scheduler shoveling more jobs at us. > >>>>>>>>>> Well exactly that's the problem. The scheduler is supposed to shoveling > >>>>>>>>>> more jobs at us until it is empty. > >>>>>>>>>> > >>>>>>>>>> Thinking more about it we will then keep some dma_fence instance > >>>>>>>>>> unsignaled and that is and extremely bad idea since it can lead to > >>>>>>>>>> deadlocks during suspend. > >>>>>>>>> Hmm, perhaps that is true if you need to migrate things out of vram? > >>>>>>>>> It is at least not a problem when vram is not involved. > >>>>>>>> No, it's much wider than that. > >>>>>>>> > >>>>>>>> See what can happen is that the memory management shrinkers want to wait > >>>>>>>> for a dma_fence during suspend. > >>>>>>> we don't wait on fences in shrinker, only purging or evicting things > >>>>>>> that are already ready. Actually, waiting on fences in shrinker path > >>>>>>> sounds like a pretty bad idea. > >>>>>>> > >>>>>>>> And if you stop the scheduler they will just wait forever. > >>>>>>>> > >>>>>>>> What you need to do instead is to drain the scheduler, e.g. call > >>>>>>>> drm_sched_entity_flush() with a proper timeout for each entity you have > >>>>>>>> created. > >>>>>>> yeah, it would work to drain the scheduler.. I guess that might be the > >>>>>>> more portable approach as far as generic solution for suspend. > >>>>>>> > >>>>>>> BR, > >>>>>>> -R > >>>>>> I am not sure how this drains the scheduler ? Suppose we done the > >>>>>> waiting in drm_sched_entity_flush, > >>>>>> what prevents someone to push right away another job into the same > >>>>>> entity's queue right after that ? > >>>>>> Shouldn't we first disable further pushing of jobs into entity before we > >>>>>> wait for sched->job_scheduled ? > >>>>>> > >>>>> In the system suspend path, userspace processes will have already been > >>>>> frozen, so there should be no way to push more jobs to the scheduler, > >>>>> unless they are pushed from the kernel itself. > >>>>> amdgpu_device_suspend > >>>> It was my suspicion but I wasn't sure about it. > >>>> > >>>> > >>>>> We don't do that in > >>>>> drm/msm, but maybe you need to to move things btwn vram and system > >>>>> memory? > >>>> Exactly, that was my main concern - if we use this method we have to use > >>>> it in a point in > >>>> suspend sequence when all the in kernel job submissions activity already > >>>> suspended > >>>> > >>>>> But even in that case, if the # of jobs you push is bounded I > >>>>> guess that is ok? > >>>> Submissions to scheduler entities are using unbounded queue, the bounded > >>>> part is when > >>>> you extract next job from entity to submit to HW ring and it rejects if > >>>> submission limit reached (drm_sched_ready) > >>>> > >>>> In general - It looks to me at least that what we what we want her is > >>>> more of a drain operation then flush (i.e. > >>>> we first want to disable any further job submission to entity's queue > >>>> and then flush all in flight ones). As example > >>>> for this i was looking at flush_workqueue vs. drain_workqueue > >>> Would it be possible for amdgpu to, in the system suspend task, > >>> > >>> 1) first queue up all the jobs needed to migrate bos out of vram, and > >>> whatever other housekeeping jobs are needed > >>> 2) then drain gpu scheduler's queues > >>> 3) and then finally wait for jobs executing on GPU to complete > >> > >> We already do most of it in amdgpu_device_suspend, > >> amdgpu_device_ip_suspend_phase1 > >> followed by amdgpu_device_evict_resources followed by > >> amdgpu_fence_driver_hw_fini is > >> exactly steps 1 + 3. What we are missing is step 2). For this step I > >> suggest adding a function > >> called drm_sched_entity_drain which basically sets entity->stopped = > >> true and then calls drm_sched_entity_flush. > >> This will both reject any new insertions into entity's job queue and > >> will flush all pending job submissions to HW from that entity. > >> One point is we need to make make drm_sched_entity_push_job return value > >> so the caller knows about job enqueue > >> rejection. > > Hmm, seems like job enqueue that is rejected because we are in the > > process of suspending should be more of a WARN_ON() sort of thing? > > Not sure if there is something sensible to do for the caller at that > > point? > > > What about the job's fence the caller is waiting on ? If we rejected > job submission the caller must know about it to not get stuck waiting > on that fence. > Hmm, perhaps I'm not being imaginative enough, but this sort of scenario seems like it should only arise from a bug in the driver's suspend path, Ie. not doing all the job submission before shutting down the scheduler. I don't think anything good is going to result either way, which is why I was thinking you'd want a WARN_ON() to help debug/fix that case. > > > > >> What about runtime suspend ? I guess same issue with scheduler racing > >> against HW susppend is relevant there ? > > Runtime suspend should be ok, as long as the driver holds a runpm > > reference whenever the hw needs to be awake. The problem with system > > suspend (at least if you are using pm_runtime_force_suspend() or doing > > something equivalent) is that it bypasses the runpm reference. > > (Which, IMO, seems like a bad design..) > > > I am not totally clear yet - can you expand a bit why one case is ok > but the other > problematic ? > Sure, normally pm_runtime_get/put increment a reference count, as long as there have been more get's than puts, the device won't runtime suspend. So, for ex, msm's run_job fxn does a pm_runtime_get_sync(). And retire_submit() which runs after job completes on GPU does a pm_runtime_put_autosuspend(). System suspend, OTOH, bypasses this reference counting. Which is why extra care is needed. BR, -R > Andrey > > > > > >> Also, could you point to a particular buggy scenario where the race > >> between SW shceduler and suspend is causing a problem ? > > I wrote a piglit test[1] to try to trigger this scenario.. it isn't > > really that easy to hit > > > > BR, > > -R > > > > [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fmesa%2Fpiglit%2F-%2Fmerge_requests%2F643&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C502ac8db4fb94b3b0e9d08da08fb270e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637832172051790527%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=u2Fqq%2BZpmjFHQFK77xwxEA5092O3Nc%2FdCMllfejgnvU%3D&reserved=0 > > > >> Andrey > >> > >> > >>> BR, > >>> -R > >>> > >>>> Andrey > >>>> > >>>> > >>>>> BR, > >>>>> -R