On Wed, Aug 21, 2024 at 10:14:34AM +0200, Christian König wrote: > Am 20.08.24 um 18:00 schrieb Thomas Hellström: > > > Or why exactly should shrinking fail? > > A common example would be not having runtime pm and the particular bo > > needs it to unbind, we want to try the next bo. Example: i915 GGTT > > bound bos and Lunar Lake PL_TT bos. > > WHAT? So you basically block shrinking BOs because you can't unbind them > because the device is powered down? Yes. amdgpu does the same btw :-) It's a fairly fundamental issue of rpm on discrete gpus, or anything that looks a lot like a discrete gpu. The deadlock scenario is roughly: - In runtime suspend you need to copy any bo out of vram into system ram before you power the gpu. This requires bo and ttm locks. - You can't just avoid this by holding an rpm reference as long as any bo is still in vram, because that defacto means you'll never autosuspend at runtime. Plus most real hw is complex enough that you have some driver objects that you need to throw out or recreate, so in practice no way to avoid all this. - runtime resume tends to be easier and mostly doable without taking bo and ttm locks, because by design you know no one else can possibly have any need to get at the gpu hw - it was all powered off after all. It's still messy, but doable. - Unfortunately this doesn't help, because your runtime resume might need to wait for a in-progress suspend operation to complete. Which means you still deadlock even if your resume path has entirely reasonable locking. On integrated you can mostly avoid this all because there's no need to swap out bo to system memory, they're there already. Exceptions like the busted coherency stuff on LNL aside. But on discrete it's just suck. TTM discrete gpu drivers avoided all that by simply not having a shrinker where you need to runtime pm get, instead all runtime pm gets are outmost, without holding any ttm or bo locks. > I would say that this is a serious NO-GO. It basically means that powered > down devices can lock down system memory for undefined amount of time. > > In other words an application can allocate memory, map it into GGTT and then > suspend or even get killed and we are not able to recover the memory because > there is no activity on the GPU any more? > > That really sounds like a bug in the driver design to me. It's a bug in the runtime pm core imo. I think interim what Thomas laid out is the best solution, since in practice when the gpu is off you really shouldn't need to wake it up. Except when you're unlucky and racing a runtime suspend against a shrinker activity (like runtime suspend throws a bo into system memory, and the shrinker then immediately wants to swap it out). I've been pondering this mess for a few months, and I think I have a solution. But it's a lot of work in core pm code unfortunately: I think we need to split the runtime_suspend callback into two halfes: ->runtime_suspend_prepare This would be run by the rpm core code from a worker without holding any locks at all. Also, any runtime_pm_get call will not wait on this prepare callback to finish, so it's up to the driver to make sure all the locking is there. Synchronous suspend calls obviously have to wait for this to finish, but that should only happen during system suspend or driver unload, where we don't have these deadlock issues. Drivers can use this callback for any non-destructive prep work (non-destructive aside from the copy engine time wasted if it fails) like swapping bo from vram to system memory. Drivers must not actually shut down the hardware because a runtime_pm_get call must succeed without waiting for this callback to finish. If any runtime_pm_get call happens while the suspend attempt will be aborted without further action. ->runtime_suspend This does the actual hw power-off. The power core must guarantee that the ->runtime_suspend_prepare has successfully completed at least once without the rpm refcount being elevated from 0 to 1 again. This way drivers can assume that all bo have been swapped out from vram already, and there's no need to acquire bo or ttm locks in the suspend path that could block the resume path. Which would then allow unconditional runtime_pm_get in the shrinker paths. Unfortunately this will be all really tricky to implement and I think needs to be done in the rumtime pm core. Cheers, Sima -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch