On Wed, 2024-08-21 at 10:57 +0200, Thomas Hellström wrote: > On Wed, 2024-08-21 at 10:14 +0200, Christian König wrote: > > Am 20.08.24 um 18:00 schrieb Thomas Hellström: > > > > Or why exactly should shrinking fail? > > > A common example would be not having runtime pm and the > > > particular > > > bo > > > needs it to unbind, we want to try the next bo. Example: i915 > > > GGTT > > > bound bos and Lunar Lake PL_TT bos. > > > > WHAT? So you basically block shrinking BOs because you can't unbind > > them > > because the device is powered down? > > > > I would say that this is a serious NO-GO. It basically means that > > powered down devices can lock down system memory for undefined > > amount > > of > > time. > > > > In other words an application can allocate memory, map it into GGTT > > and > > then suspend or even get killed and we are not able to recover the > > memory because there is no activity on the GPU any more? > > > > That really sounds like a bug in the driver design to me. > > It's bad but it's not as bad as it sounds. > > Problem is we can't wake up during direct reclaim IIRC due to runtime > pm lockdep violations, but we can and do fire up a thread to wake the > device and after the wakeup delay have subsequent shrink calls > succeed, > or punt to kswapd or the oom handler. > I think that's an orthogonal discussion, though. There are other > reasons shrinking might fail, like the bo being busy in direct > reclaim > (shouldn't wait for idle there but ok in kswapd), Other points of > failure is ofc shmem radix tree allocations (not seen one yet, > though) > which might succeed with a smaller bo. > (Not saying, though, that there isn't more to be done with the xe > runtime pm implementation). > > > > > > And again, all other drm bo shrinkers do this. We just want to do > > > the > > > same. > > > > Do you have pointers? > > As Sima said, this is complicated but not beyond comprehension: i915 > https://elixir.bootlin.com/linux/v6.11-rc4/source/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c#L317 > > msm: > https://elixir.bootlin.com/linux/v6.11-rc4/source/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c#L317 > which uses > https://elixir.bootlin.com/linux/v6.11-rc4/source/drivers/gpu/drm/drm_gem.c#L1426 > that is very similar in structure to what I implemented for TTM. > > Panfrost: (although only purgeable objects AFAICT). > https://elixir.bootlin.com/linux/v6.11-rc4/source/drivers/gpu/drm/drm_gem.c#L1426 > > > > > > > > > If we bump LRU we could end up with infinite loops. > > > > > So IMO we need to be able to loop. I don't really care wether > > > > > we do > > > > > this as an explicit loop or whether we use the LRU walker, > > > > > but > > > > > I > > > > > think > > > > > from a maintainability point-of-view it is better to keep LRU > > > > > walking > > > > > in a single place. > > > > > > > > > > If we return an unlocked object, we'd need to refcount and > > > > > drop > > > > > the > > > > > lru > > > > > lock, but maybe that's not a bad thing. > > > > > > > > > > But what's the main drawback of exporting the existing > > > > > helper. > > > > Well that we re-creates exactly the mid-layer mess I worked so > > > > hard > > > > to > > > > remove from TTM. > > > It doesn't IMO. I agree the first attempt did. This affects only > > > the > > > LRU iteration itself and I'm even fine to get rid of the callback > > > using > > > a for_ macro. > > > > Well, I mean using a for_each approach is objectively better than > > having > > a callback and a state bag. > > > > But the fundamental question is if drivers are allowed to reject > > shrinking. And I think the answer is no, they need to be designed > > in > > a > > way where shrinking is always possible. > > Rejects can be out of our control, due to anticipated deadlocks, oom > and deferring to kswapd. > > > > > What can be that we can't get the necessary locks to evict and > > object > > (because it's about to be used etc...), but that are the per- > > requisites > > TTM should be checking. > > > > > > > In any case, I don't think TTM should enforce a different way > > > > > of > > > > > shrinking by the means of a severely restricted helper? > > > > Well, as far as I can see that is exactly what TTM should do. > > > > > > > > I mean the main advantage to make a common component is to > > > > enforce > > > > correct behavior. > > > But if all other drivers don't agree this as correct behavior and > > > instead want to keep behavior that is proven to work, that's a > > > dead > > > end. > > > > Well no, even if all drivers agree to (for example) drop security > > precautions it's still not something acceptable. > > > > And same thing here, if we block shrinking because drivers think > > they > > want their runtime PM implemented in a certain way then upstream > > needs > > to block this and push back. > > > > As far as I can see it's mandatory to have shrinkers not depend on > > runtime PM, cause otherwise you run into resources handling which > > depends on the well behavior of userspace and that in turn in > > something > > we can't allow. > > Please see the above explanation for runtime pm, and for the record I > agree that enforcing disallowed or security violations is a > completely > valid thing. Meant to say enforcing disallowing security violations.. Another thing that came to my mind is ofc swap_space. Depending on how backup is done if we're out of swap space we can only shrink purgeable WONTNEED objects (user-space pools typically), and TTM has no knowledge of those (currently at least). /Thomas > > /Thomas > > > > > Regards, > > Christian. > > > > > > > > /Thomas > > > > > > > > > > Regards, > > > > Christian. > > > > > > > > > /Thomas > > > > > >