Re: [PATCH 4/7] drm/ttm: move LRU walk defines into new internal header

Christian König <christian.koenig@xxxxxxx> · Thu, 22 Aug 2024 15:19:29 +0200

    Am 22.08.24 um 11:23 schrieb Daniel Vetter:

      On Wed, Aug 21, 2024 at 10:14:34AM +0200, Christian König wrote:

        Am 20.08.24 um 18:00 schrieb Thomas Hellström:

            Or why exactly should shrinking fail?

          A common example would be not having runtime pm and the particular bo
needs it to unbind, we want to try the next bo. Example: i915 GGTT
bound bos and Lunar Lake PL_TT bos.

        WHAT? So you basically block shrinking BOs because you can't unbind them
because the device is powered down?

      Yes. amdgpu does the same btw :-)

    Well, amdgpu wouldn't block shrinking as far as I know.

    When the GPU is powered down all fences are signaled and things like
    GART unbinding is just postponed until the GPU wakes up again.

      It's a fairly fundamental issue of rpm on discrete gpus, or anything that
looks a lot like a discrete gpu. The deadlock scenario is roughly:

- In runtime suspend you need to copy any bo out of vram into system ram
  before you power the gpu. This requires bo and ttm locks.

- You can't just avoid this by holding an rpm reference as long as any bo
  is still in vram, because that defacto means you'll never autosuspend at
  runtime. Plus most real hw is complex enough that you have some driver
  objects that you need to throw out or recreate, so in practice no way to
  avoid all this.

- runtime resume tends to be easier and mostly doable without taking bo
  and ttm locks, because by design you know no one else can possibly have
  any need to get at the gpu hw - it was all powered off after all. It's
  still messy, but doable.

- Unfortunately this doesn't help, because your runtime resume might need
  to wait for a in-progress suspend operation to complete. Which means you
  still deadlock even if your resume path has entirely reasonable locking.

    Uff, yeah that totally confirms my gut feeling that this looked
    really questionable.

      On integrated you can mostly avoid this all because there's no need to
swap out bo to system memory, they're there already. Exceptions like the
busted coherency stuff on LNL aside.

But on discrete it's just suck.

    Mhm, I never worked with pure integrated devices but at least
    radeon, amdgpu and nouveau seems to have a solution which would work
    as far as I can tell.

    Basically on suspend you make sure that everything necessary for
    shrinking is done, e.g. waiting for fences, evacuating VRAM etc...

    Hardware specific updates like GART while the device is suspended
    are postponed until resume.

      TTM discrete gpu drivers avoided all that by simply not having a shrinker
where you need to runtime pm get, instead all runtime pm gets are outmost,
without holding any ttm or bo locks.

    Yes, exactly that.

        I would say that this is a serious NO-GO. It basically means that powered
down devices can lock down system memory for undefined amount of time.

In other words an application can allocate memory, map it into GGTT and then
suspend or even get killed and we are not able to recover the memory because
there is no activity on the GPU any more?

That really sounds like a bug in the driver design to me.

      It's a bug in the runtime pm core imo. I think interim what Thomas laid
out is the best solution, since in practice when the gpu is off you really
shouldn't need to wake it up. Except when you're unlucky and racing a
runtime suspend against a shrinker activity (like runtime suspend throws a
bo into system memory, and the shrinker then immediately wants to swap it
out).

    Mhm, why exactly is that problematic?

    Wouldn't pm_runtime_get_if_in_use() just return 0 in this situation
    and we postpone any hw activity?

      I've been pondering this mess for a few months, and I think I have a
solution. But it's a lot of work in core pm code unfortunately:

I think we need to split the runtime_suspend callback into two halfes:

->runtime_suspend_prepare

This would be run by the rpm core code from a worker without holding any
locks at all. Also, any runtime_pm_get call will not wait on this prepare
callback to finish, so it's up to the driver to make sure all the locking
is there. Synchronous suspend calls obviously have to wait for this to
finish, but that should only happen during system suspend or driver
unload, where we don't have these deadlock issues.

Drivers can use this callback for any non-destructive prep work
(non-destructive aside from the copy engine time wasted if it fails) like
swapping bo from vram to system memory. Drivers must not actually shut
down the hardware because a runtime_pm_get call must succeed without
waiting for this callback to finish.

If any runtime_pm_get call happens while the suspend attempt will be
aborted without further action.

->runtime_suspend

This does the actual hw power-off. The power core must guarantee that the
->runtime_suspend_prepare has successfully completed at least once without
the rpm refcount being elevated from 0 to 1 again.

This way drivers can assume that all bo have been swapped out from vram
already, and there's no need to acquire bo or ttm locks in the suspend
path that could block the resume path.

Which would then allow unconditional runtime_pm_get in the shrinker paths.

Unfortunately this will be all really tricky to implement and I think
needs to be done in the rumtime pm core.

    Completely agree that this is complicated, but I still don't see the
    need for it.

    Drivers just need to use pm_runtime_get_if_in_use() inside the
    shrinker and postpone all hw activity until resume.

    At least for amdgpu that shouldn't be a problem at all.

    Regards,

    Christian.

Cheers, Sima