Re: [PATCH 4/7] drm/ttm: move LRU walk defines into new internal header

Daniel Vetter <daniel.vetter@xxxxxxxx> · Tue, 27 Aug 2024 18:52:13 +0200

[machine died, new one working now, I can read complicated mails again an
answer.]

On Thu, Aug 22, 2024 at 03:19:29PM +0200, Christian König wrote:
> Am 22.08.24 um 11:23 schrieb Daniel Vetter:
> > On Wed, Aug 21, 2024 at 10:14:34AM +0200, Christian König wrote:
> > > Am 20.08.24 um 18:00 schrieb Thomas Hellström:
> > > > > Or why exactly should shrinking fail?
> > > > A common example would be not having runtime pm and the particular bo
> > > > needs it to unbind, we want to try the next bo. Example: i915 GGTT
> > > > bound bos and Lunar Lake PL_TT bos.
> > > WHAT? So you basically block shrinking BOs because you can't unbind them
> > > because the device is powered down?
> > Yes. amdgpu does the same btw :-)
> 
> Well, amdgpu wouldn't block shrinking as far as I know.
> 
> When the GPU is powered down all fences are signaled and things like GART
> unbinding is just postponed until the GPU wakes up again.
> 
> > It's a fairly fundamental issue of rpm on discrete gpus, or anything that
> > looks a lot like a discrete gpu. The deadlock scenario is roughly:
> > 
> > - In runtime suspend you need to copy any bo out of vram into system ram
> >    before you power the gpu. This requires bo and ttm locks.
> > 
> > - You can't just avoid this by holding an rpm reference as long as any bo
> >    is still in vram, because that defacto means you'll never autosuspend at
> >    runtime. Plus most real hw is complex enough that you have some driver
> >    objects that you need to throw out or recreate, so in practice no way to
> >    avoid all this.
> > 
> > - runtime resume tends to be easier and mostly doable without taking bo
> >    and ttm locks, because by design you know no one else can possibly have
> >    any need to get at the gpu hw - it was all powered off after all. It's
> >    still messy, but doable.
> > 
> > - Unfortunately this doesn't help, because your runtime resume might need
> >    to wait for a in-progress suspend operation to complete. Which means you
> >    still deadlock even if your resume path has entirely reasonable locking.
> 
> Uff, yeah that totally confirms my gut feeling that this looked really
> questionable.
> 
> > On integrated you can mostly avoid this all because there's no need to
> > swap out bo to system memory, they're there already. Exceptions like the
> > busted coherency stuff on LNL aside.
> > 
> > But on discrete it's just suck.
> 
> Mhm, I never worked with pure integrated devices but at least radeon, amdgpu
> and nouveau seems to have a solution which would work as far as I can tell.

Yeah integrated is easy (if you don't do silly stuff like intel on their
latest).

> Basically on suspend you make sure that everything necessary for shrinking
> is done, e.g. waiting for fences, evacuating VRAM etc...

This is the hard part, or the "evacuating VRAM" part at least. My proposal
is that essentially the runtime_suspend_prepare hook does that. Imo the
clean design is that we wait for rpm autosuspend, and only then start to
evacuate VRAM. Otherwise you need to hold a rpm reference if there's
anything in VRAM, which kinda defeats the point of runtime pm ...

Or you have your own runtime pm refcount that tracks whether anyone uses
the gpu, and if that drops to 0 for long enough you evacuate VRAM, and
then drop the rpm reference. Which is just implementing the idea I've
typed up, except hand-rolled in driver code and with a rpm refcount that's
entirely driver internal.

And at least last time I checked, amdgpu does the VRAM evacuation in the
runtime pm suspend callback, which is why all your runtime_pm_get calls
are outside of the big bo/ttm locks.

> Hardware specific updates like GART while the device is suspended are
> postponed until resume.

Yeah GART shouldn't be the issue, except when you're racing.

> > TTM discrete gpu drivers avoided all that by simply not having a shrinker
> > where you need to runtime pm get, instead all runtime pm gets are outmost,
> > without holding any ttm or bo locks.
> 
> Yes, exactly that.
> 
> > > I would say that this is a serious NO-GO. It basically means that powered
> > > down devices can lock down system memory for undefined amount of time.
> > > 
> > > In other words an application can allocate memory, map it into GGTT and then
> > > suspend or even get killed and we are not able to recover the memory because
> > > there is no activity on the GPU any more?
> > > 
> > > That really sounds like a bug in the driver design to me.
> > It's a bug in the runtime pm core imo. I think interim what Thomas laid
> > out is the best solution, since in practice when the gpu is off you really
> > shouldn't need to wake it up. Except when you're unlucky and racing a
> > runtime suspend against a shrinker activity (like runtime suspend throws a
> > bo into system memory, and the shrinker then immediately wants to swap it
> > out).
> 
> Mhm, why exactly is that problematic?
> 
> Wouldn't pm_runtime_get_if_in_use() just return 0 in this situation and we
> postpone any hw activity?

So when you're runtime suspend, you need to evacuate VRAM. Which means
potentially a lot needs to be moved into system memory. Which means it's
likely the shrinker gets active. Also, it's the point where
pm_runtime_get_if_in_use() will consistently fail, so right when you need
the shrinker to be reliable it will fail the most.

> > I've been pondering this mess for a few months, and I think I have a
> > solution. But it's a lot of work in core pm code unfortunately:
> > 
> > I think we need to split the runtime_suspend callback into two halfes:
> > 
> > ->runtime_suspend_prepare
> > 
> > This would be run by the rpm core code from a worker without holding any
> > locks at all. Also, any runtime_pm_get call will not wait on this prepare
> > callback to finish, so it's up to the driver to make sure all the locking
> > is there. Synchronous suspend calls obviously have to wait for this to
> > finish, but that should only happen during system suspend or driver
> > unload, where we don't have these deadlock issues.
> > 
> > Drivers can use this callback for any non-destructive prep work
> > (non-destructive aside from the copy engine time wasted if it fails) like
> > swapping bo from vram to system memory. Drivers must not actually shut
> > down the hardware because a runtime_pm_get call must succeed without
> > waiting for this callback to finish.
> > 
> > If any runtime_pm_get call happens while the suspend attempt will be
> > aborted without further action.
> > 
> > ->runtime_suspend
> > 
> > This does the actual hw power-off. The power core must guarantee that the
> > ->runtime_suspend_prepare has successfully completed at least once without
> > the rpm refcount being elevated from 0 to 1 again.
> > 
> > This way drivers can assume that all bo have been swapped out from vram
> > already, and there's no need to acquire bo or ttm locks in the suspend
> > path that could block the resume path.
> > 
> > Which would then allow unconditional runtime_pm_get in the shrinker paths.
> > 
> > Unfortunately this will be all really tricky to implement and I think
> > needs to be done in the rumtime pm core.
> 
> Completely agree that this is complicated, but I still don't see the need
> for it.
> 
> Drivers just need to use pm_runtime_get_if_in_use() inside the shrinker and
> postpone all hw activity until resume.

Not good enough, at least long term I think. Also postponing hw activity
to resume doesn't solve the deadlock issue, if you still need to grab ttm
locks on resume.

> At least for amdgpu that shouldn't be a problem at all.

Yeah, because atm amdgpu has the rpm_get calls outside of ttm locks, which
is exactly the inversion you've complained about as being broken driver
design.
-Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch