Re: [PATCH 2/3] drm/msm/gpu: Park scheduler threads for system suspend

Rob Clark <robdclark@xxxxxxxxx> · Fri, 18 Mar 2022 10:22:47 -0700

On Fri, Mar 18, 2022 at 9:27 AM Andrey Grodzovsky
<andrey.grodzovsky@xxxxxxx> wrote:
>
>
> On 2022-03-18 12:20, Rob Clark wrote:
> > On Fri, Mar 18, 2022 at 9:04 AM Andrey Grodzovsky
> > <andrey.grodzovsky@xxxxxxx> wrote:
> >>
> >> On 2022-03-17 16:35, Rob Clark wrote:
> >>> On Thu, Mar 17, 2022 at 12:50 PM Andrey Grodzovsky
> >>> <andrey.grodzovsky@xxxxxxx> wrote:
> >>>> On 2022-03-17 14:25, Rob Clark wrote:
> >>>>> On Thu, Mar 17, 2022 at 11:10 AM Andrey Grodzovsky
> >>>>> <andrey.grodzovsky@xxxxxxx> wrote:
> >>>>>> On 2022-03-17 13:35, Rob Clark wrote:
> >>>>>>> On Thu, Mar 17, 2022 at 9:45 AM Christian König
> >>>>>>> <christian.koenig@xxxxxxx> wrote:
> >>>>>>>> Am 17.03.22 um 17:18 schrieb Rob Clark:
> >>>>>>>>> On Thu, Mar 17, 2022 at 9:04 AM Christian König
> >>>>>>>>> <christian.koenig@xxxxxxx> wrote:
> >>>>>>>>>> Am 17.03.22 um 16:10 schrieb Rob Clark:
> >>>>>>>>>>> [SNIP]
> >>>>>>>>>>> userspace frozen != kthread frozen .. that is what this patch is
> >>>>>>>>>>> trying to address, so we aren't racing between shutting down the hw
> >>>>>>>>>>> and the scheduler shoveling more jobs at us.
> >>>>>>>>>> Well exactly that's the problem. The scheduler is supposed to shoveling
> >>>>>>>>>> more jobs at us until it is empty.
> >>>>>>>>>>
> >>>>>>>>>> Thinking more about it we will then keep some dma_fence instance
> >>>>>>>>>> unsignaled and that is and extremely bad idea since it can lead to
> >>>>>>>>>> deadlocks during suspend.
> >>>>>>>>> Hmm, perhaps that is true if you need to migrate things out of vram?
> >>>>>>>>> It is at least not a problem when vram is not involved.
> >>>>>>>> No, it's much wider than that.
> >>>>>>>>
> >>>>>>>> See what can happen is that the memory management shrinkers want to wait
> >>>>>>>> for a dma_fence during suspend.
> >>>>>>> we don't wait on fences in shrinker, only purging or evicting things
> >>>>>>> that are already ready.  Actually, waiting on fences in shrinker path
> >>>>>>> sounds like a pretty bad idea.
> >>>>>>>
> >>>>>>>> And if you stop the scheduler they will just wait forever.
> >>>>>>>>
> >>>>>>>> What you need to do instead is to drain the scheduler, e.g. call
> >>>>>>>> drm_sched_entity_flush() with a proper timeout for each entity you have
> >>>>>>>> created.
> >>>>>>> yeah, it would work to drain the scheduler.. I guess that might be the
> >>>>>>> more portable approach as far as generic solution for suspend.
> >>>>>>>
> >>>>>>> BR,
> >>>>>>> -R
> >>>>>> I am not sure how this drains the scheduler ? Suppose we done the
> >>>>>> waiting in drm_sched_entity_flush,
> >>>>>> what prevents someone to push right away another job into the same
> >>>>>> entity's queue  right after that ?
> >>>>>> Shouldn't we first disable further pushing of jobs into entity before we
> >>>>>> wait for  sched->job_scheduled ?
> >>>>>>
> >>>>> In the system suspend path, userspace processes will have already been
> >>>>> frozen, so there should be no way to push more jobs to the scheduler,
> >>>>> unless they are pushed from the kernel itself.
> >>>>> amdgpu_device_suspend
> >>>> It was my suspicion but I wasn't sure about it.
> >>>>
> >>>>
> >>>>> We don't do that in
> >>>>> drm/msm, but maybe you need to to move things btwn vram and system
> >>>>> memory?
> >>>> Exactly, that was my main concern - if we use this method we have to use
> >>>> it in a point in
> >>>> suspend sequence when all the in kernel job submissions activity already
> >>>> suspended
> >>>>
> >>>>> But even in that case, if the # of jobs you push is bounded I
> >>>>> guess that is ok?
> >>>> Submissions to scheduler entities are using unbounded queue, the bounded
> >>>> part is when
> >>>> you extract next job from entity to submit to HW ring and it rejects if
> >>>> submission limit reached (drm_sched_ready)
> >>>>
> >>>> In general - It looks to me at least that what we what we want her is
> >>>> more of a drain operation then flush (i.e.
> >>>> we first want to disable any further job submission to entity's queue
> >>>> and then flush all in flight ones). As example
> >>>> for this i was looking at  flush_workqueue vs. drain_workqueue
> >>> Would it be possible for amdgpu to, in the system suspend task,
> >>>
> >>> 1) first queue up all the jobs needed to migrate bos out of vram, and
> >>> whatever other housekeeping jobs are needed
> >>> 2) then drain gpu scheduler's queues
> >>> 3) and then finally wait for jobs executing on GPU to complete
> >>
> >> We already do most of it in amdgpu_device_suspend,
> >> amdgpu_device_ip_suspend_phase1
> >> followed by amdgpu_device_evict_resources followed by
> >> amdgpu_fence_driver_hw_fini is
> >> exactly steps 1 + 3. What we are missing is step 2). For this step I
> >> suggest adding a function
> >> called drm_sched_entity_drain which basically sets entity->stopped =
> >> true and then calls drm_sched_entity_flush.
> >> This will both reject any new insertions into entity's job queue and
> >> will flush all pending job submissions to HW from that entity.
> >> One point is we need to make make drm_sched_entity_push_job return value
> >> so the caller knows about job enqueue
> >> rejection.
> > Hmm, seems like job enqueue that is rejected because we are in the
> > process of suspending should be more of a WARN_ON() sort of thing?
> > Not sure if there is something sensible to do for the caller at that
> > point?
>
>
> What about the job's fence the caller is waiting on ? If we rejected
> job submission the caller must know about it to not get stuck waiting
> on that fence.
>

Hmm, perhaps I'm not being imaginative enough, but this sort of
scenario seems like it should only arise from a bug in the driver's
suspend path, Ie. not doing all the job submission before shutting
down the scheduler.  I don't think anything good is going to result
either way, which is why I was thinking you'd want a WARN_ON() to help
debug/fix that case.

>
> >
> >> What about runtime suspend ? I guess same issue with scheduler racing
> >> against HW susppend is relevant there ?
> > Runtime suspend should be ok, as long as the driver holds a runpm
> > reference whenever the hw needs to be awake.  The problem with system
> > suspend (at least if you are using pm_runtime_force_suspend() or doing
> > something equivalent) is that it bypasses the runpm reference.
> > (Which, IMO, seems like a bad design..)
>
>
> I am not totally clear  yet - can you expand a bit why one case is ok
> but the other
> problematic ?
>

Sure, normally pm_runtime_get/put increment a reference count, as long
as there have been more get's than puts, the device won't runtime
suspend.  So, for ex, msm's run_job fxn does a pm_runtime_get_sync().
And retire_submit() which runs after job completes on GPU does a
pm_runtime_put_autosuspend().

System suspend, OTOH, bypasses this reference counting.  Which is why
extra care is needed.

BR,
-R

> Andrey
>
>
> >
> >> Also, could you point to a particular buggy scenario where the race
> >> between SW shceduler and suspend is causing a problem ?
> > I wrote a piglit test[1] to try to trigger this scenario.. it isn't
> > really that easy to hit
> >
> > BR,
> > -R
> >
> > [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fmesa%2Fpiglit%2F-%2Fmerge_requests%2F643&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7C502ac8db4fb94b3b0e9d08da08fb270e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637832172051790527%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=u2Fqq%2BZpmjFHQFK77xwxEA5092O3Nc%2FdCMllfejgnvU%3D&amp;reserved=0
> >
> >> Andrey
> >>
> >>
> >>> BR,
> >>> -R
> >>>
> >>>> Andrey
> >>>>
> >>>>
> >>>>> BR,
> >>>>> -R