Re: [PATCH 4/7] drm/ttm: move LRU walk defines into new internal header

Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> · Thu, 22 Aug 2024 15:16:44 +0200

On Thu, 2024-08-22 at 11:29 +0200, Christian König wrote:
> Am 22.08.24 um 10:21 schrieb Thomas Hellström:
> > On Thu, 2024-08-22 at 09:55 +0200, Christian König wrote:
> > > Am 22.08.24 um 08:47 schrieb Thomas Hellström:
> > > > > > > As Sima said, this is complicated but not beyond
> > > > > > > comprehension:
> > > > > > > i915
> > > > > > > https://elixir.bootlin.com/linux/v6.11-rc4/source/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c#L317
> > > > > > As far as I can tell what i915 does here is extremely
> > > > > > questionable.
> > > > > > 
> > > > > >        if (sc->nr_scanned < sc->nr_to_scan &&
> > > > > > current_is_kswapd()) {
> > > > > > ....
> > > > > >            with_intel_runtime_pm(&i915->runtime_pm,
> > > > > > wakeref) {
> > > > > > 
> > > > > > with_intel_runtime_pm() then calls pm_runtime_get_sync().
> > > > > > 
> > > > > > So basically the i915 shrinker assumes that when called
> > > > > > from
> > > > > > kswapd()
> > > > > > that it can synchronously wait for runtime PM to power up
> > > > > > the
> > > > > > device
> > > > > > again.
> > > > > > 
> > > > > > As far as I can tell that means that a device driver makes
> > > > > > strong
> > > > > > and
> > > > > > completely undocumented assumptions how kswapd works
> > > > > > internally.
> > > > > Admittedly that looks weird
> > > > > 
> > > > > But I'd really expect a reclaim lockdep splat to happen there
> > > > > if
> > > > > the
> > > > > i915 pm did something not-allowed. IIRC, the design direction
> > > > > the
> > > > > i915
> > > > > people got from mm people regarding the shrinkers was to
> > > > > avoid
> > > > > any
> > > > > sleeps in direct reclaim and punt it to kswapd. Need to ask
> > > > > i915
> > > > > people
> > > > > how they can get away with that.
> > > > > 
> > > > > 
> > > > So it turns out that Xe integrated pm resume is reclaim-safe,
> > > > and
> > > > I'd
> > > > expect i915's to be as well. Xe discrete pm resume isn't.
> > > > 
> > > > So that means that, at least for integrated, the i915 shrinker
> > > > should
> > > > be ok from that POW, and punting certain bos to kswapd is not
> > > > AFAICT
> > > > abusing any undocumented features of kswapd but rather a way to
> > > > avoid
> > > > resuming the device during direct reclaim, like documented.
> > > The more I think about this the more I disagree to this driver
> > > design.
> > > In my opinion device drivers should *never* resume runtime PM in
> > > a
> > > shrinker callback in the first place.
> > Runtime PM resume is allowed even from irq context if carefully
> > implemented by the driver and flagged as such to the core.
> > 
> > https://docs.kernel.org/power/runtime_pm.html
> > 
> > Resuming runtime PM from reclaim therefore shouldn't be an issue
> > IMO,
> > and really up to the driver.
> 
> Mhm when it's up to the driver on which level to use runtime PM then 
> that at least explains why the framework doesn't have lockdep
> annotations.
> 
> Ok, that is at least convincing the what i915 does here should work
> somehow.
> 
> > > When the device is turned off it means that all of it's
> > > operations
> > > are
> > > stopped and eventually power to caches etc turned off as well. So
> > > I
> > > don't see any ongoing writeback operations or similar either.
> > > 
> > > So the question is why do we need to power on the device in a
> > > shrinker
> > > in the first place?
> > > 
> > > What could be is that the device needs to flush GART TLBs or
> > > similar
> > > when it is turned on, e.g. that you grab a PM reference to make
> > > sure
> > > that during your HW operation the device doesn't suspend.
> > Exactly why the i915 needs to flush the GART I'm not sure of but I
> > suspect the gart TLB might be forgotten while suspended.
> 
> Well that is unproblematic. Amdgpu and I think nouveau does something
> similar.
> 
> But you don't need to resume the hardware for this, just grabbing the
> reference to make sure that it doesn't suspend is sufficient.
> 
> The assumption I make here is that you don't need to do anything when
> the hardware is power down anyway. That seems to be true for at least
> the hardware designs I'm aware of.

I'm not sure I understand you correctly but do you suggest that each bo
with a GGTT mapping should hold a pm reference and we refuse to suspend
while there are GGTT mappings?

> 
> > > But that doesn't mean that you should resume the device. In other
> > > words
> > > when the device is powered down you shouldn't power it up again.
> > > 
> > > And for GART we already have the necessary move callback
> > > implemented
> > > in
> > > TTM. This is done by radeon, amdgpu and nouveu in a common way as
> > > far
> > > as
> > > I can see.
> > > 
> > > So why should Xe be special and follow the very questionable
> > > approach
> > > of
> > > i915 here?
> > For Xe, Lunar Lake (integrated) has the interesting design that
> > each bo
> > carries compression metadata that needs to be blitted to system
> > pages
> > during shrinking. The alternative is to resolve all buffer objects
> > at
> > device runtime suspend...
> 
> That's the same for amdgpu as well, but when the device is powered
> down 
> those compression data needs to be evacuated anyway.

That's true for Intel discrete when entering D3Cold, which is typically
not done if that means a huge amount of data to evict.

Integrated doesn't suspend to D3Cold but using D3Hot and the
compression metadata is persistent, but is only accessible when the
device is woken. 

> 
> 
> 
> > But runtime PM aside, with a one-bo only approach we still have the
> > drawbacks that it
> > 
> > * eliminates possibility for driver deadlock avoidance
> > * Requires TTM knowledge of "purgeable" bos
> > * Requires an additional LRU to avoid O(n2) traversal of already
> > shrunken objects
> > * Drivers with legitimate shrinker designs that don't fit in the
> > TTM-
> > enforced model will have frustrated maintainers.
> 
> I still find that only halve-convincing. The real question is if it's
> a 
> good idea to give drivers the power to decide what to shrink and what
> not to shrink.
> 
> And at least with the arguments and experience at hand I would vote
> for 
> not doing that. We have added the eviction_valuable callback for
> amdgpu 
> and ended up in quite a mess with that.
> 
> Background is that some eviction decision done by the driver where
> not 
> as optimal as we hoped it to be.
> 
> On the other hand keeping track of all the swapped out objects should
> be 
> TTMs job anyway, e.g. having a TTM_PL_SWAPPED domain.
> 
> So in my mind the ideal solution still looks like this:
> 
> driver_specific_shrinker_scan(...)
> {
>      driver_specific_preparations(...);
>      bo = ttm_reserve_next_bo_to_shrink(...);
>      ttm_bo_validate(bo, TTM_PL_SWAPPED);
>      ttm_bo_unreserver(bo);
>      driver_specific_cleanups(...);
> }
> 
> When there is a potential deadlock because the shrinker might be
> called 
> from driver code which holds locks the driver needs to it's specific 
> preparation or cleanup then those would apply to all BOs and not just
> the one returned from TTM.
> 
> The only use case I can see were the driver would need to filter out
> the 
> BOs to shrink would be if TTM doesn't know about all the information
> to 
> make a decision what to shrink and exactly that is what I try to
> avoid.

Yeah, but unfortunately the use cases we have require per-bo decisions,
so suggested strucure becomes:

driver_specific_shrinker_scan(...)
{
     ttm_for_each_suggested_bo_to_shrink(&bo) {
	if (driver_preparations_and_ok_to_shrink(bo)) {
		ttm_shrink_this(&bo);
		driver_cleanups();
		if (vmscan_thinks_weve_done_enough())
			break;
     }
}

Thanks,
Thomas

> 
> Regards,
> Christian.
> 
> > 
> > Thanks,
> > Thomas
> > 
> > 
> > > Regards,
> > > Christian.
> > > 
> > > 
> > > > /Thomas
> > > > 
> > > > 
> > > >