On Thu, 2024-08-22 at 11:29 +0200, Christian König wrote: > Am 22.08.24 um 10:21 schrieb Thomas Hellström: > > On Thu, 2024-08-22 at 09:55 +0200, Christian König wrote: > > > Am 22.08.24 um 08:47 schrieb Thomas Hellström: > > > > > > > As Sima said, this is complicated but not beyond > > > > > > > comprehension: > > > > > > > i915 > > > > > > > https://elixir.bootlin.com/linux/v6.11-rc4/source/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c#L317 > > > > > > As far as I can tell what i915 does here is extremely > > > > > > questionable. > > > > > > > > > > > > if (sc->nr_scanned < sc->nr_to_scan && > > > > > > current_is_kswapd()) { > > > > > > .... > > > > > > with_intel_runtime_pm(&i915->runtime_pm, > > > > > > wakeref) { > > > > > > > > > > > > with_intel_runtime_pm() then calls pm_runtime_get_sync(). > > > > > > > > > > > > So basically the i915 shrinker assumes that when called > > > > > > from > > > > > > kswapd() > > > > > > that it can synchronously wait for runtime PM to power up > > > > > > the > > > > > > device > > > > > > again. > > > > > > > > > > > > As far as I can tell that means that a device driver makes > > > > > > strong > > > > > > and > > > > > > completely undocumented assumptions how kswapd works > > > > > > internally. > > > > > Admittedly that looks weird > > > > > > > > > > But I'd really expect a reclaim lockdep splat to happen there > > > > > if > > > > > the > > > > > i915 pm did something not-allowed. IIRC, the design direction > > > > > the > > > > > i915 > > > > > people got from mm people regarding the shrinkers was to > > > > > avoid > > > > > any > > > > > sleeps in direct reclaim and punt it to kswapd. Need to ask > > > > > i915 > > > > > people > > > > > how they can get away with that. > > > > > > > > > > > > > > So it turns out that Xe integrated pm resume is reclaim-safe, > > > > and > > > > I'd > > > > expect i915's to be as well. Xe discrete pm resume isn't. > > > > > > > > So that means that, at least for integrated, the i915 shrinker > > > > should > > > > be ok from that POW, and punting certain bos to kswapd is not > > > > AFAICT > > > > abusing any undocumented features of kswapd but rather a way to > > > > avoid > > > > resuming the device during direct reclaim, like documented. > > > The more I think about this the more I disagree to this driver > > > design. > > > In my opinion device drivers should *never* resume runtime PM in > > > a > > > shrinker callback in the first place. > > Runtime PM resume is allowed even from irq context if carefully > > implemented by the driver and flagged as such to the core. > > > > https://docs.kernel.org/power/runtime_pm.html > > > > Resuming runtime PM from reclaim therefore shouldn't be an issue > > IMO, > > and really up to the driver. > > Mhm when it's up to the driver on which level to use runtime PM then > that at least explains why the framework doesn't have lockdep > annotations. > > Ok, that is at least convincing the what i915 does here should work > somehow. > > > > When the device is turned off it means that all of it's > > > operations > > > are > > > stopped and eventually power to caches etc turned off as well. So > > > I > > > don't see any ongoing writeback operations or similar either. > > > > > > So the question is why do we need to power on the device in a > > > shrinker > > > in the first place? > > > > > > What could be is that the device needs to flush GART TLBs or > > > similar > > > when it is turned on, e.g. that you grab a PM reference to make > > > sure > > > that during your HW operation the device doesn't suspend. > > Exactly why the i915 needs to flush the GART I'm not sure of but I > > suspect the gart TLB might be forgotten while suspended. > > Well that is unproblematic. Amdgpu and I think nouveau does something > similar. > > But you don't need to resume the hardware for this, just grabbing the > reference to make sure that it doesn't suspend is sufficient. > > The assumption I make here is that you don't need to do anything when > the hardware is power down anyway. That seems to be true for at least > the hardware designs I'm aware of. I'm not sure I understand you correctly but do you suggest that each bo with a GGTT mapping should hold a pm reference and we refuse to suspend while there are GGTT mappings? > > > > But that doesn't mean that you should resume the device. In other > > > words > > > when the device is powered down you shouldn't power it up again. > > > > > > And for GART we already have the necessary move callback > > > implemented > > > in > > > TTM. This is done by radeon, amdgpu and nouveu in a common way as > > > far > > > as > > > I can see. > > > > > > So why should Xe be special and follow the very questionable > > > approach > > > of > > > i915 here? > > For Xe, Lunar Lake (integrated) has the interesting design that > > each bo > > carries compression metadata that needs to be blitted to system > > pages > > during shrinking. The alternative is to resolve all buffer objects > > at > > device runtime suspend... > > That's the same for amdgpu as well, but when the device is powered > down > those compression data needs to be evacuated anyway. That's true for Intel discrete when entering D3Cold, which is typically not done if that means a huge amount of data to evict. Integrated doesn't suspend to D3Cold but using D3Hot and the compression metadata is persistent, but is only accessible when the device is woken. > > > > > But runtime PM aside, with a one-bo only approach we still have the > > drawbacks that it > > > > * eliminates possibility for driver deadlock avoidance > > * Requires TTM knowledge of "purgeable" bos > > * Requires an additional LRU to avoid O(n2) traversal of already > > shrunken objects > > * Drivers with legitimate shrinker designs that don't fit in the > > TTM- > > enforced model will have frustrated maintainers. > > I still find that only halve-convincing. The real question is if it's > a > good idea to give drivers the power to decide what to shrink and what > not to shrink. > > And at least with the arguments and experience at hand I would vote > for > not doing that. We have added the eviction_valuable callback for > amdgpu > and ended up in quite a mess with that. > > Background is that some eviction decision done by the driver where > not > as optimal as we hoped it to be. > > On the other hand keeping track of all the swapped out objects should > be > TTMs job anyway, e.g. having a TTM_PL_SWAPPED domain. > > So in my mind the ideal solution still looks like this: > > driver_specific_shrinker_scan(...) > { > driver_specific_preparations(...); > bo = ttm_reserve_next_bo_to_shrink(...); > ttm_bo_validate(bo, TTM_PL_SWAPPED); > ttm_bo_unreserver(bo); > driver_specific_cleanups(...); > } > > When there is a potential deadlock because the shrinker might be > called > from driver code which holds locks the driver needs to it's specific > preparation or cleanup then those would apply to all BOs and not just > the one returned from TTM. > > The only use case I can see were the driver would need to filter out > the > BOs to shrink would be if TTM doesn't know about all the information > to > make a decision what to shrink and exactly that is what I try to > avoid. Yeah, but unfortunately the use cases we have require per-bo decisions, so suggested strucure becomes: driver_specific_shrinker_scan(...) { ttm_for_each_suggested_bo_to_shrink(&bo) { if (driver_preparations_and_ok_to_shrink(bo)) { ttm_shrink_this(&bo); driver_cleanups(); if (vmscan_thinks_weve_done_enough()) break; } } Thanks, Thomas > > Regards, > Christian. > > > > > Thanks, > > Thomas > > > > > > > Regards, > > > Christian. > > > > > > > > > > /Thomas > > > > > > > > > > > >