On Tue, Aug 27, 2024 at 12:58 PM Daniel Vetter <daniel.vetter@xxxxxxxx> wrote: > > [machine died, new one working now, I can read complicated mails again an > answer.] > > On Thu, Aug 22, 2024 at 03:19:29PM +0200, Christian König wrote: > > Am 22.08.24 um 11:23 schrieb Daniel Vetter: > > > On Wed, Aug 21, 2024 at 10:14:34AM +0200, Christian König wrote: > > > > Am 20.08.24 um 18:00 schrieb Thomas Hellström: > > > > > > Or why exactly should shrinking fail? > > > > > A common example would be not having runtime pm and the particular bo > > > > > needs it to unbind, we want to try the next bo. Example: i915 GGTT > > > > > bound bos and Lunar Lake PL_TT bos. > > > > WHAT? So you basically block shrinking BOs because you can't unbind them > > > > because the device is powered down? > > > Yes. amdgpu does the same btw :-) > > > > Well, amdgpu wouldn't block shrinking as far as I know. > > > > When the GPU is powered down all fences are signaled and things like GART > > unbinding is just postponed until the GPU wakes up again. > > > > > It's a fairly fundamental issue of rpm on discrete gpus, or anything that > > > looks a lot like a discrete gpu. The deadlock scenario is roughly: > > > > > > - In runtime suspend you need to copy any bo out of vram into system ram > > > before you power the gpu. This requires bo and ttm locks. > > > > > > - You can't just avoid this by holding an rpm reference as long as any bo > > > is still in vram, because that defacto means you'll never autosuspend at > > > runtime. Plus most real hw is complex enough that you have some driver > > > objects that you need to throw out or recreate, so in practice no way to > > > avoid all this. > > > > > > - runtime resume tends to be easier and mostly doable without taking bo > > > and ttm locks, because by design you know no one else can possibly have > > > any need to get at the gpu hw - it was all powered off after all. It's > > > still messy, but doable. > > > > > > - Unfortunately this doesn't help, because your runtime resume might need > > > to wait for a in-progress suspend operation to complete. Which means you > > > still deadlock even if your resume path has entirely reasonable locking. > > > > Uff, yeah that totally confirms my gut feeling that this looked really > > questionable. > > > > > On integrated you can mostly avoid this all because there's no need to > > > swap out bo to system memory, they're there already. Exceptions like the > > > busted coherency stuff on LNL aside. > > > > > > But on discrete it's just suck. > > > > Mhm, I never worked with pure integrated devices but at least radeon, amdgpu > > and nouveau seems to have a solution which would work as far as I can tell. > > Yeah integrated is easy (if you don't do silly stuff like intel on their > latest). > > > Basically on suspend you make sure that everything necessary for shrinking > > is done, e.g. waiting for fences, evacuating VRAM etc... > > This is the hard part, or the "evacuating VRAM" part at least. My proposal > is that essentially the runtime_suspend_prepare hook does that. Imo the > clean design is that we wait for rpm autosuspend, and only then start to > evacuate VRAM. Otherwise you need to hold a rpm reference if there's > anything in VRAM, which kinda defeats the point of runtime pm ... > > Or you have your own runtime pm refcount that tracks whether anyone uses > the gpu, and if that drops to 0 for long enough you evacuate VRAM, and > then drop the rpm reference. Which is just implementing the idea I've > typed up, except hand-rolled in driver code and with a rpm refcount that's > entirely driver internal. > > And at least last time I checked, amdgpu does the VRAM evacuation in the > runtime pm suspend callback, which is why all your runtime_pm_get calls > are outside of the big bo/ttm locks. It used to do the eviction in the runtime_suspend callback, but we ran into problems where eviction would sometimes fail if there was too much memory pressure during suspend. Moving it to prepare made things fail earlier. Ultimately, I think the problem is that runtime pm disables swap during runtime pm. See: https://gitlab.freedesktop.org/drm/amd/-/issues/2362#note_2111666 Alex > > > Hardware specific updates like GART while the device is suspended are > > postponed until resume. > > Yeah GART shouldn't be the issue, except when you're racing. > > > > TTM discrete gpu drivers avoided all that by simply not having a shrinker > > > where you need to runtime pm get, instead all runtime pm gets are outmost, > > > without holding any ttm or bo locks. > > > > Yes, exactly that. > > > > > > I would say that this is a serious NO-GO. It basically means that powered > > > > down devices can lock down system memory for undefined amount of time. > > > > > > > > In other words an application can allocate memory, map it into GGTT and then > > > > suspend or even get killed and we are not able to recover the memory because > > > > there is no activity on the GPU any more? > > > > > > > > That really sounds like a bug in the driver design to me. > > > It's a bug in the runtime pm core imo. I think interim what Thomas laid > > > out is the best solution, since in practice when the gpu is off you really > > > shouldn't need to wake it up. Except when you're unlucky and racing a > > > runtime suspend against a shrinker activity (like runtime suspend throws a > > > bo into system memory, and the shrinker then immediately wants to swap it > > > out). > > > > Mhm, why exactly is that problematic? > > > > Wouldn't pm_runtime_get_if_in_use() just return 0 in this situation and we > > postpone any hw activity? > > So when you're runtime suspend, you need to evacuate VRAM. Which means > potentially a lot needs to be moved into system memory. Which means it's > likely the shrinker gets active. Also, it's the point where > pm_runtime_get_if_in_use() will consistently fail, so right when you need > the shrinker to be reliable it will fail the most. > > > > I've been pondering this mess for a few months, and I think I have a > > > solution. But it's a lot of work in core pm code unfortunately: > > > > > > I think we need to split the runtime_suspend callback into two halfes: > > > > > > ->runtime_suspend_prepare > > > > > > This would be run by the rpm core code from a worker without holding any > > > locks at all. Also, any runtime_pm_get call will not wait on this prepare > > > callback to finish, so it's up to the driver to make sure all the locking > > > is there. Synchronous suspend calls obviously have to wait for this to > > > finish, but that should only happen during system suspend or driver > > > unload, where we don't have these deadlock issues. > > > > > > Drivers can use this callback for any non-destructive prep work > > > (non-destructive aside from the copy engine time wasted if it fails) like > > > swapping bo from vram to system memory. Drivers must not actually shut > > > down the hardware because a runtime_pm_get call must succeed without > > > waiting for this callback to finish. > > > > > > If any runtime_pm_get call happens while the suspend attempt will be > > > aborted without further action. > > > > > > ->runtime_suspend > > > > > > This does the actual hw power-off. The power core must guarantee that the > > > ->runtime_suspend_prepare has successfully completed at least once without > > > the rpm refcount being elevated from 0 to 1 again. > > > > > > This way drivers can assume that all bo have been swapped out from vram > > > already, and there's no need to acquire bo or ttm locks in the suspend > > > path that could block the resume path. > > > > > > Which would then allow unconditional runtime_pm_get in the shrinker paths. > > > > > > Unfortunately this will be all really tricky to implement and I think > > > needs to be done in the rumtime pm core. > > > > Completely agree that this is complicated, but I still don't see the need > > for it. > > > > Drivers just need to use pm_runtime_get_if_in_use() inside the shrinker and > > postpone all hw activity until resume. > > Not good enough, at least long term I think. Also postponing hw activity > to resume doesn't solve the deadlock issue, if you still need to grab ttm > locks on resume. > > > At least for amdgpu that shouldn't be a problem at all. > > Yeah, because atm amdgpu has the rpm_get calls outside of ttm locks, which > is exactly the inversion you've complained about as being broken driver > design. > -Sima > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch