Re: [PATCH 4/7] drm/ttm: move LRU walk defines into new internal header

Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> · Wed, 09 Oct 2024 15:39:20 +0200

On Mon, 2024-10-07 at 11:08 +0200, Christian König wrote:
> Hi Thomas,
> 
> I'm on sick leave, but I will try to answer questions and share some 
> thoughts on this to unblock you.
> 
> Am 18.09.24 um 14:57 schrieb Thomas Hellström:
> > Sima, Christian
> > 
> > I've updated the shrinker series now with a guarded for_each macro
> > instead:
> > 
> > https://patchwork.freedesktop.org/patch/614514/?series=131815&rev=9
> 
> Yeah that looks like a big step in the right direction.
> 
> > (Note I forgot to remove the export of the previous LRU walker).
> > 
> >   so the midlayer argument is now not an issue anymore. The
> > cleanup.h
> > guard provides some additional protection against drivers exiting
> > the
> > LRU loop early.
> > 
> > So remaining is the question whether the driver is allowed to
> > discard a
> > suggested bo to shrink from TTM.
> > 
> > Arguments for:
> > 
> > 1) Not allowing that would require teaching TTM about purgeable
> > objects.
> 
> I think that is actually not correct. TTM already knows about
> purgeable 
> objects.
> 
> E.g. when TTM asks the driver what to do with evicted objects it is 
> perfectly valid to return an empty placement list indicating that the
> backing store can just be thrown away.
> 
> We use this for things like temporary buffers for example.
> 
> That this doesn't apply to swapping looks like a design bug in the 
> driver/TTM interface to me.

Yes we can do that with TTM, but for shrinking there's no point in
trying to shrink when there is no swap-space left, other than purgeable
object. The number of shrinkable objects returned in shrinker::count
needs to reflect that, and *then* only those objects should be selected
for shrinking. If we were to announce that to TTM using a callback,
we're actually back to version 1 of this series which was rejected by
you exactly since it was using callbacks a year or so ago????

> 
> > 2) Devices who need the blitter during shrinking would want to punt
> > runtime_pm_get() to kswapd to avoid sleeping direct reclaim.
> 
> I think the outcome of the discussion is that runtime PM should never
> be 
> mixed into TTM swapping.
> 
> You can power up blocks to enable a HW blitter for swapping, but this
> then can't be driven by the runtime PM framework.

Still that power-on might be sleeping, so what's the difference using
runtime-pm or not? Why should the driver implement yet another power
interface, just because TTM refuses to publish a sane LRU walk
interface?

> 
> > 3) If those devices end up blitting (LNL) to be able to shrink,
> > they
> > would want to punt waiting for the fence to signal to kswapd to
> > avoid
> > waiting in direct reclaim.
> 
> Mhm, what do you mean with that?

When we fired the blitter from direct reclaim, we get a fence. If we
wait for it in direct reclaim we will be sleeping waiting for gpu,
which is bad form. We'd like return a failure so the object is retried
when idle, or from kswapd.

> 
> 
> > 4) It looks like we need to resort to folio_trylock in the shmem
> > backup
> > backend when shrinking is called for gfp_t = GFP_NOFS. A failing
> > trylock will require a new bo.
> 
> Why would a folio trylock succeed with one BO and not another?

Good point. We'd fail anyway but would perhaps need to call
SHRINK_STOP..

> 
> And why would that trylock something the device specific driver
> should 
> handle?

It happens in the TTM shrinker helper called from the driver in the
spirit of demidlayering.

> 
> > Arguments against:
> > None really. I thought the idea of demidlayering would be to allow
> > the
> > driver more freedom.
> 
> Well that is a huge argument against it. Giving drivers more freedom
> is 
> absolutely not something which turned out to be valuable in the past.

So then what's the point of demidlayering?

> 
> Instead we should put device drivers in a very strict corset of 
> validated approaches to do things.
> 
> Background is that in my experience driver developers are perfectly 
> willing to do unclean approaches which only work in like 99% of all 
> cases just to get a bit more performance or simpler driver
> implementation.
> 
> Those approaches are not legal and in my opinion as subsystem
> maintainer 
> I think we need to be more strict and push back much harder on stuff 
> like that.

Still, historically that has made developers abandon common components
for driver-specific solutions. 

And the question is still not answered.

Exactly *why* can't the driver fail and continue traversing the LRU,
because all our argumentation revolves around this and you have yet to
provide an objective reason why. 

/Thomas

> 
> Regards,
> Christian.
> 
> > 
> > So any feedback appreciated. If that is found acceptable we can
> > proceed
> > with reviewing this patch and also with the shrinker series.
> > 
> > Thanks,
> > Thomas
> > 
> > 
> > On Mon, 2024-09-02 at 13:07 +0200, Daniel Vetter wrote:
> > > On Wed, Aug 28, 2024 at 02:20:34PM +0200, Christian König wrote:
> > > > Am 27.08.24 um 19:53 schrieb Daniel Vetter:
> > > > > On Tue, Aug 27, 2024 at 06:52:13PM +0200, Daniel Vetter
> > > > > wrote:
> > > > > > On Thu, Aug 22, 2024 at 03:19:29PM +0200, Christian König
> > > > > > wrote:
> > > > > > > Completely agree that this is complicated, but I still
> > > > > > > don't
> > > > > > > see the need
> > > > > > > for it.
> > > > > > > 
> > > > > > > Drivers just need to use pm_runtime_get_if_in_use()
> > > > > > > inside
> > > > > > > the shrinker and
> > > > > > > postpone all hw activity until resume.
> > > > > > Not good enough, at least long term I think. Also
> > > > > > postponing hw
> > > > > > activity
> > > > > > to resume doesn't solve the deadlock issue, if you still
> > > > > > need
> > > > > > to grab ttm
> > > > > > locks on resume.
> > > > > Pondered this specific aspect some more, and I think you
> > > > > still
> > > > > have a race
> > > > > here (even if you avoid the deadlock): If the condiditional
> > > > > rpm_get call
> > > > > fails there's no guarantee that the device will
> > > > > suspend/resume
> > > > > and clean
> > > > > up the GART mapping.
> > > > Well I think we have a major disconnect here. When the device
> > > > is
> > > > powered
> > > > down there is no GART mapping to clean up any more.
> > > > 
> > > > In other words GART is a table in local memory (VRAM) when the
> > > > device is
> > > > powered down this table is completely destroyed. Any BO which
> > > > was
> > > > mapped
> > > > inside this table is now not mapped any more.
> > > > 
> > > > So when the shrinker wants to evict a BO which is marked as
> > > > mapped
> > > > to GART
> > > > and the device is powered down we just skip the GART unmapping
> > > > part
> > > > because
> > > > that has already implicitly happened during power down.
> > > > 
> > > > Before mapping any BO into the GART again we power the GPU up
> > > > through the
> > > > runtime PM calls. And while powering it up again the GART is
> > > > restored.
> > > My point is that you can't tell whether the device will power
> > > down or
> > > not,
> > > you can only tell whether there's a chance it might be powering
> > > down
> > > and
> > > so you can't get at the rpm reference without deadlock issues.
> > > 
> > > > > The race gets a bit smaller if you use
> > > > > pm_runtime_get_if_active(), but even then you might catch it
> > > > > right when
> > > > > resume almost finished.
> > > > What race are you talking about?
> > > > 
> > > > The worst thing which could happen is that we restore a GART
> > > > entry
> > > > which
> > > > isn't needed any more, but that is pretty much irrelevant since
> > > > we
> > > > only
> > > > clear them to avoid some hw bugs.
> > > The race I'm seeing is where you thought the GART entry is not
> > > issue,
> > > tossed an object, but the device didn't suspend, so might still
> > > use
> > > it.
> > > 
> > > I guess if we're clearly separating the sw allocation of the
> > > TTM_TT
> > > with
> > > the physical entries in the GART that should all work, but feels
> > > a
> > > bit
> > > tricky. The race I've seen is essentially these two getting out
> > > of
> > > sync.
> > > 
> > > So maybe it was me who's stuck.
> > > 
> > > What I wonder is whether it works in practice, since on the
> > > restore
> > > side
> > > you need to get some locks to figure out which gart mappings
> > > exist
> > > and
> > > need restoring. And that's the same locks as the shrinker needs
> > > to
> > > figure
> > > out whether it might need to reap a gart mapping.
> > > 
> > > Or do you just copy the gart entries over and restore them
> > > exactly
> > > as-is,
> > > so that there's no shared locks?
> > > 
> > > > > That means we'll have ttm bo hanging around with GART
> > > > > allocations/mappings
> > > > > which aren't actually valid anymore (since they might escape
> > > > > the
> > > > > cleanup
> > > > > upon resume due to the race). That doesn't feel like a solid
> > > > > design
> > > > > either.
> > > > I'm most likely missing something, but I'm really scratching my
> > > > head where
> > > > you see a problem here.
> > > I guess one issue is that at least traditionally, igfx drivers
> > > have
> > > nested
> > > runtime pm within dma_resv lock. And dgpu drivers the other way
> > > round.
> > > Which is a bit awkward if you're trying for common code.
> > > 
> > > Cheers, Sima
>