Re: [PATCH 4/7] drm/ttm: move LRU walk defines into new internal header

Daniel Vetter <daniel.vetter@xxxxxxxx> · Thu, 22 Aug 2024 11:23:47 +0200

On Wed, Aug 21, 2024 at 10:14:34AM +0200, Christian König wrote:
> Am 20.08.24 um 18:00 schrieb Thomas Hellström:
> > > Or why exactly should shrinking fail?
> > A common example would be not having runtime pm and the particular bo
> > needs it to unbind, we want to try the next bo. Example: i915 GGTT
> > bound bos and Lunar Lake PL_TT bos.
> 
> WHAT? So you basically block shrinking BOs because you can't unbind them
> because the device is powered down?

Yes. amdgpu does the same btw :-)

It's a fairly fundamental issue of rpm on discrete gpus, or anything that
looks a lot like a discrete gpu. The deadlock scenario is roughly:

- In runtime suspend you need to copy any bo out of vram into system ram
  before you power the gpu. This requires bo and ttm locks.

- You can't just avoid this by holding an rpm reference as long as any bo
  is still in vram, because that defacto means you'll never autosuspend at
  runtime. Plus most real hw is complex enough that you have some driver
  objects that you need to throw out or recreate, so in practice no way to
  avoid all this.

- runtime resume tends to be easier and mostly doable without taking bo
  and ttm locks, because by design you know no one else can possibly have
  any need to get at the gpu hw - it was all powered off after all. It's
  still messy, but doable.

- Unfortunately this doesn't help, because your runtime resume might need
  to wait for a in-progress suspend operation to complete. Which means you
  still deadlock even if your resume path has entirely reasonable locking.

On integrated you can mostly avoid this all because there's no need to
swap out bo to system memory, they're there already. Exceptions like the
busted coherency stuff on LNL aside.

But on discrete it's just suck.

TTM discrete gpu drivers avoided all that by simply not having a shrinker
where you need to runtime pm get, instead all runtime pm gets are outmost,
without holding any ttm or bo locks.

> I would say that this is a serious NO-GO. It basically means that powered
> down devices can lock down system memory for undefined amount of time.
> 
> In other words an application can allocate memory, map it into GGTT and then
> suspend or even get killed and we are not able to recover the memory because
> there is no activity on the GPU any more?
> 
> That really sounds like a bug in the driver design to me.

It's a bug in the runtime pm core imo. I think interim what Thomas laid
out is the best solution, since in practice when the gpu is off you really
shouldn't need to wake it up. Except when you're unlucky and racing a
runtime suspend against a shrinker activity (like runtime suspend throws a
bo into system memory, and the shrinker then immediately wants to swap it
out).

I've been pondering this mess for a few months, and I think I have a
solution. But it's a lot of work in core pm code unfortunately:

I think we need to split the runtime_suspend callback into two halfes:

->runtime_suspend_prepare

This would be run by the rpm core code from a worker without holding any
locks at all. Also, any runtime_pm_get call will not wait on this prepare
callback to finish, so it's up to the driver to make sure all the locking
is there. Synchronous suspend calls obviously have to wait for this to
finish, but that should only happen during system suspend or driver
unload, where we don't have these deadlock issues.

Drivers can use this callback for any non-destructive prep work
(non-destructive aside from the copy engine time wasted if it fails) like
swapping bo from vram to system memory. Drivers must not actually shut
down the hardware because a runtime_pm_get call must succeed without
waiting for this callback to finish.

If any runtime_pm_get call happens while the suspend attempt will be
aborted without further action.

->runtime_suspend

This does the actual hw power-off. The power core must guarantee that the
->runtime_suspend_prepare has successfully completed at least once without
the rpm refcount being elevated from 0 to 1 again.

This way drivers can assume that all bo have been swapped out from vram
already, and there's no need to acquire bo or ttm locks in the suspend
path that could block the resume path.

Which would then allow unconditional runtime_pm_get in the shrinker paths.

Unfortunately this will be all really tricky to implement and I think
needs to be done in the rumtime pm core.

Cheers, Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch