Re: [PATCH 4/7] drm/ttm: move LRU walk defines into new internal header

Daniel Vetter <daniel.vetter@xxxxxxxx> · Mon, 2 Sep 2024 13:00:44 +0200



On Tue, Aug 27, 2024 at 02:24:27PM -0400, Alex Deucher wrote:
> On Tue, Aug 27, 2024 at 12:58 PM Daniel Vetter <daniel.vetter@xxxxxxxx> wrote:
> >
> > [machine died, new one working now, I can read complicated mails again an
> > answer.]
> >
> > On Thu, Aug 22, 2024 at 03:19:29PM +0200, Christian König wrote:
> > > Am 22.08.24 um 11:23 schrieb Daniel Vetter:
> > > > On Wed, Aug 21, 2024 at 10:14:34AM +0200, Christian König wrote:
> > > > > Am 20.08.24 um 18:00 schrieb Thomas Hellström:
> > > > > > > Or why exactly should shrinking fail?
> > > > > > A common example would be not having runtime pm and the particular bo
> > > > > > needs it to unbind, we want to try the next bo. Example: i915 GGTT
> > > > > > bound bos and Lunar Lake PL_TT bos.
> > > > > WHAT? So you basically block shrinking BOs because you can't unbind them
> > > > > because the device is powered down?
> > > > Yes. amdgpu does the same btw :-)
> > >
> > > Well, amdgpu wouldn't block shrinking as far as I know.
> > >
> > > When the GPU is powered down all fences are signaled and things like GART
> > > unbinding is just postponed until the GPU wakes up again.
> > >
> > > > It's a fairly fundamental issue of rpm on discrete gpus, or anything that
> > > > looks a lot like a discrete gpu. The deadlock scenario is roughly:
> > > >
> > > > - In runtime suspend you need to copy any bo out of vram into system ram
> > > >    before you power the gpu. This requires bo and ttm locks.
> > > >
> > > > - You can't just avoid this by holding an rpm reference as long as any bo
> > > >    is still in vram, because that defacto means you'll never autosuspend at
> > > >    runtime. Plus most real hw is complex enough that you have some driver
> > > >    objects that you need to throw out or recreate, so in practice no way to
> > > >    avoid all this.
> > > >
> > > > - runtime resume tends to be easier and mostly doable without taking bo
> > > >    and ttm locks, because by design you know no one else can possibly have
> > > >    any need to get at the gpu hw - it was all powered off after all. It's
> > > >    still messy, but doable.
> > > >
> > > > - Unfortunately this doesn't help, because your runtime resume might need
> > > >    to wait for a in-progress suspend operation to complete. Which means you
> > > >    still deadlock even if your resume path has entirely reasonable locking.
> > >
> > > Uff, yeah that totally confirms my gut feeling that this looked really
> > > questionable.
> > >
> > > > On integrated you can mostly avoid this all because there's no need to
> > > > swap out bo to system memory, they're there already. Exceptions like the
> > > > busted coherency stuff on LNL aside.
> > > >
> > > > But on discrete it's just suck.
> > >
> > > Mhm, I never worked with pure integrated devices but at least radeon, amdgpu
> > > and nouveau seems to have a solution which would work as far as I can tell.
> >
> > Yeah integrated is easy (if you don't do silly stuff like intel on their
> > latest).
> >
> > > Basically on suspend you make sure that everything necessary for shrinking
> > > is done, e.g. waiting for fences, evacuating VRAM etc...
> >
> > This is the hard part, or the "evacuating VRAM" part at least. My proposal
> > is that essentially the runtime_suspend_prepare hook does that. Imo the
> > clean design is that we wait for rpm autosuspend, and only then start to
> > evacuate VRAM. Otherwise you need to hold a rpm reference if there's
> > anything in VRAM, which kinda defeats the point of runtime pm ...
> >
> > Or you have your own runtime pm refcount that tracks whether anyone uses
> > the gpu, and if that drops to 0 for long enough you evacuate VRAM, and
> > then drop the rpm reference. Which is just implementing the idea I've
> > typed up, except hand-rolled in driver code and with a rpm refcount that's
> > entirely driver internal.
> >
> > And at least last time I checked, amdgpu does the VRAM evacuation in the
> > runtime pm suspend callback, which is why all your runtime_pm_get calls
> > are outside of the big bo/ttm locks.
> 
> It used to do the eviction in the runtime_suspend callback, but we ran
> into problems where eviction would sometimes fail if there was too
> much memory pressure during suspend.  Moving it to prepare made things
> fail earlier.  Ultimately, I think the problem is that runtime pm
> disables swap during runtime pm. See:
> https://gitlab.freedesktop.org/drm/amd/-/issues/2362#note_2111666

That's system suspend/resume, where you have a prepare callback. At least
as far as I know, for runtime pm this doesn't exist.
-Sima

> 
> Alex
> 
> >
> > > Hardware specific updates like GART while the device is suspended are
> > > postponed until resume.
> >
> > Yeah GART shouldn't be the issue, except when you're racing.
> >
> > > > TTM discrete gpu drivers avoided all that by simply not having a shrinker
> > > > where you need to runtime pm get, instead all runtime pm gets are outmost,
> > > > without holding any ttm or bo locks.
> > >
> > > Yes, exactly that.
> > >
> > > > > I would say that this is a serious NO-GO. It basically means that powered
> > > > > down devices can lock down system memory for undefined amount of time.
> > > > >
> > > > > In other words an application can allocate memory, map it into GGTT and then
> > > > > suspend or even get killed and we are not able to recover the memory because
> > > > > there is no activity on the GPU any more?
> > > > >
> > > > > That really sounds like a bug in the driver design to me.
> > > > It's a bug in the runtime pm core imo. I think interim what Thomas laid
> > > > out is the best solution, since in practice when the gpu is off you really
> > > > shouldn't need to wake it up. Except when you're unlucky and racing a
> > > > runtime suspend against a shrinker activity (like runtime suspend throws a
> > > > bo into system memory, and the shrinker then immediately wants to swap it
> > > > out).
> > >
> > > Mhm, why exactly is that problematic?
> > >
> > > Wouldn't pm_runtime_get_if_in_use() just return 0 in this situation and we
> > > postpone any hw activity?
> >
> > So when you're runtime suspend, you need to evacuate VRAM. Which means
> > potentially a lot needs to be moved into system memory. Which means it's
> > likely the shrinker gets active. Also, it's the point where
> > pm_runtime_get_if_in_use() will consistently fail, so right when you need
> > the shrinker to be reliable it will fail the most.
> >
> > > > I've been pondering this mess for a few months, and I think I have a
> > > > solution. But it's a lot of work in core pm code unfortunately:
> > > >
> > > > I think we need to split the runtime_suspend callback into two halfes:
> > > >
> > > > ->runtime_suspend_prepare
> > > >
> > > > This would be run by the rpm core code from a worker without holding any
> > > > locks at all. Also, any runtime_pm_get call will not wait on this prepare
> > > > callback to finish, so it's up to the driver to make sure all the locking
> > > > is there. Synchronous suspend calls obviously have to wait for this to
> > > > finish, but that should only happen during system suspend or driver
> > > > unload, where we don't have these deadlock issues.
> > > >
> > > > Drivers can use this callback for any non-destructive prep work
> > > > (non-destructive aside from the copy engine time wasted if it fails) like
> > > > swapping bo from vram to system memory. Drivers must not actually shut
> > > > down the hardware because a runtime_pm_get call must succeed without
> > > > waiting for this callback to finish.
> > > >
> > > > If any runtime_pm_get call happens while the suspend attempt will be
> > > > aborted without further action.
> > > >
> > > > ->runtime_suspend
> > > >
> > > > This does the actual hw power-off. The power core must guarantee that the
> > > > ->runtime_suspend_prepare has successfully completed at least once without
> > > > the rpm refcount being elevated from 0 to 1 again.
> > > >
> > > > This way drivers can assume that all bo have been swapped out from vram
> > > > already, and there's no need to acquire bo or ttm locks in the suspend
> > > > path that could block the resume path.
> > > >
> > > > Which would then allow unconditional runtime_pm_get in the shrinker paths.
> > > >
> > > > Unfortunately this will be all really tricky to implement and I think
> > > > needs to be done in the rumtime pm core.
> > >
> > > Completely agree that this is complicated, but I still don't see the need
> > > for it.
> > >
> > > Drivers just need to use pm_runtime_get_if_in_use() inside the shrinker and
> > > postpone all hw activity until resume.
> >
> > Not good enough, at least long term I think. Also postponing hw activity
> > to resume doesn't solve the deadlock issue, if you still need to grab ttm
> > locks on resume.
> >
> > > At least for amdgpu that shouldn't be a problem at all.
> >
> > Yeah, because atm amdgpu has the rpm_get calls outside of ttm locks, which
> > is exactly the inversion you've complained about as being broken driver
> > design.
> > -Sima
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch