On Thu, Mar 17, 2022 at 03:06:18AM -0700, Christian König wrote: > Am 17.03.22 um 10:59 schrieb Daniel Vetter: > > On Thu, Mar 10, 2022 at 03:46:05PM -0800, Rob Clark wrote: > >> From: Rob Clark <robdclark@xxxxxxxxxxxx> > >> > >> In the system suspend path, we don't want to be racing with the > >> scheduler kthreads pushing additional queued up jobs to the hw > >> queue (ringbuffer). So park them first. While we are at it, > >> move the wait for active jobs to complete into the new system- > >> suspend path. > >> > >> Signed-off-by: Rob Clark <robdclark@xxxxxxxxxxxx> > >> --- > >> drivers/gpu/drm/msm/adreno/adreno_device.c | 68 ++++++++++++++++++++-- > >> 1 file changed, 64 insertions(+), 4 deletions(-) > >> > >> diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c > >> index 8859834b51b8..0440a98988fc 100644 > >> --- a/drivers/gpu/drm/msm/adreno/adreno_device.c > >> +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c > >> @@ -619,22 +619,82 @@ static int active_submits(struct msm_gpu *gpu) > >> static int adreno_runtime_suspend(struct device *dev) > >> { > >> struct msm_gpu *gpu = dev_to_gpu(dev); > >> - int remaining; > >> + > >> + /* > >> + * We should be holding a runpm ref, which will prevent > >> + * runtime suspend. In the system suspend path, we've > >> + * already waited for active jobs to complete. > >> + */ > >> + WARN_ON_ONCE(gpu->active_submits); > >> + > >> + return gpu->funcs->pm_suspend(gpu); > >> +} > >> + > >> +static void suspend_scheduler(struct msm_gpu *gpu) > >> +{ > >> + int i; > >> + > >> + /* > >> + * Shut down the scheduler before we force suspend, so that > >> + * suspend isn't racing with scheduler kthread feeding us > >> + * more work. > >> + * > >> + * Note, we just want to park the thread, and let any jobs > >> + * that are already on the hw queue complete normally, as > >> + * opposed to the drm_sched_stop() path used for handling > >> + * faulting/timed-out jobs. We can't really cancel any jobs > >> + * already on the hw queue without racing with the GPU. > >> + */ > >> + for (i = 0; i < gpu->nr_rings; i++) { > >> + struct drm_gpu_scheduler *sched = &gpu->rb[i]->sched; > >> + kthread_park(sched->thread); > > Shouldn't we have some proper interfaces for this? > > If I'm not completely mistaken we already should have one, yes. > > > Also I'm kinda wondering how other drivers do this, feels like we should have a standard > > way. > > > > Finally not flushing out all in-flight requests sounds a bit like a bad > > idea for system suspend/resume since that's also the hibernation path, and > > that would mean your shrinker/page reclaim stops working. At least in full > > generality. Which ain't good for hibernation. > > Completely agree, that looks like an incorrect workaround to me. > > During suspend all userspace applications should be frozen and all f > their hardware activity flushed out and waited for completion. > Isn't that what Rob is doing? He kills the scheduler preventing any new job from being submitted then waits for an outstanding jobs to complete naturally complete (see the wait_event_timeout below). If the jobs don't naturally complete the suspend seems to be aborted? That flow makes sense to me and seems like a novel way to avoid races. Matt > I do remember that our internal guys came up with pretty much the same > idea and it sounded broken to me back then as well. > > Regards, > Christian. > > > > > Adding Christian and Andrey. > > -Daniel > > > >> + } > >> +} > >> + > >> +static void resume_scheduler(struct msm_gpu *gpu) > >> +{ > >> + int i; > >> + > >> + for (i = 0; i < gpu->nr_rings; i++) { > >> + struct drm_gpu_scheduler *sched = &gpu->rb[i]->sched; > >> + kthread_unpark(sched->thread); > >> + } > >> +} > >> + > >> +static int adreno_system_suspend(struct device *dev) > >> +{ > >> + struct msm_gpu *gpu = dev_to_gpu(dev); > >> + int remaining, ret; > >> + > >> + suspend_scheduler(gpu); > >> > >> remaining = wait_event_timeout(gpu->retire_event, > >> active_submits(gpu) == 0, > >> msecs_to_jiffies(1000)); > >> if (remaining == 0) { > >> dev_err(dev, "Timeout waiting for GPU to suspend\n"); > >> - return -EBUSY; > >> + ret = -EBUSY; > >> + goto out; > >> } > >> > >> - return gpu->funcs->pm_suspend(gpu); > >> + ret = pm_runtime_force_suspend(dev); > >> +out: > >> + if (ret) > >> + resume_scheduler(gpu); > >> + > >> + return ret; > >> } > >> + > >> +static int adreno_system_resume(struct device *dev) > >> +{ > >> + resume_scheduler(dev_to_gpu(dev)); > >> + return pm_runtime_force_resume(dev); > >> +} > >> + > >> #endif > >> > >> static const struct dev_pm_ops adreno_pm_ops = { > >> - SET_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend, pm_runtime_force_resume) > >> + SET_SYSTEM_SLEEP_PM_OPS(adreno_system_suspend, adreno_system_resume) > >> SET_RUNTIME_PM_OPS(adreno_runtime_suspend, adreno_runtime_resume, NULL) > >> }; > >> > >> -- > >> 2.35.1 > >> >