Re: [PATCH 4/7] drm/ttm: move LRU walk defines into new internal header

Christian König <christian.koenig@xxxxxxx> · Thu, 10 Oct 2024 10:00:31 +0200

    Am 09.10.24 um 16:17 schrieb Thomas Hellström:

      On Wed, 2024-10-09 at 15:39 +0200, Thomas Hellström wrote:

        On Mon, 2024-10-07 at 11:08 +0200, Christian König wrote:

          Hi Thomas,

I'm on sick leave, but I will try to answer questions and share
some 
thoughts on this to unblock you.

Am 18.09.24 um 14:57 schrieb Thomas Hellström:

            Sima, Christian

I've updated the shrinker series now with a guarded for_each
macro
instead:

https://patchwork.freedesktop.org/patch/614514/?series=131815&rev=9

          Yeah that looks like a big step in the right direction.

            (Note I forgot to remove the export of the previous LRU walker).

  so the midlayer argument is now not an issue anymore. The
cleanup.h
guard provides some additional protection against drivers exiting
the
LRU loop early.

So remaining is the question whether the driver is allowed to
discard a
suggested bo to shrink from TTM.

Arguments for:

1) Not allowing that would require teaching TTM about purgeable
objects.

          I think that is actually not correct. TTM already knows about
purgeable 
objects.

E.g. when TTM asks the driver what to do with evicted objects it is
perfectly valid to return an empty placement list indicating that
the
backing store can just be thrown away.

We use this for things like temporary buffers for example.

That this doesn't apply to swapping looks like a design bug in the 
driver/TTM interface to me.

        Yes we can do that with TTM, but for shrinking there's no point in
trying to shrink when there is no swap-space left, other than
purgeable
object. The number of shrinkable objects returned in shrinker::count
needs to reflect that, and *then* only those objects should be
selected
for shrinking. If we were to announce that to TTM using a callback,
we're actually back to version 1 of this series which was rejected by
you exactly since it was using callbacks a year or so ago????

    Yeah that indeed doesn't sound like a good idea.

    On the other hand the decision that only purgeable objects should be
    selected when there is no swap space left sounds like something TTM
    should do and not the driver.

            2) Devices who need the blitter during shrinking would want to
punt
runtime_pm_get() to kswapd to avoid sleeping direct reclaim.

          I think the outcome of the discussion is that runtime PM should
never
be 
mixed into TTM swapping.

You can power up blocks to enable a HW blitter for swapping, but
this
then can't be driven by the runtime PM framework.

        Still that power-on might be sleeping, so what's the difference using
runtime-pm or not? Why should the driver implement yet another power
interface, just because TTM refuses to publish a sane LRU walk
interface?

    See the discussion with Sima around the PM functions.

    My understanding might be wrong, but I now think that with local
    memory you can't do the i915 approach where the PM functions are so
    low level that they can also be called inside the shrinker.

    So you basically have PM functions for your whole device and PM
    functions for only the HW blocks necessary for the shrinker.

            3) If those devices end up blitting (LNL) to be able to shrink,
they
would want to punt waiting for the fence to signal to kswapd to
avoid
waiting in direct reclaim.

          Mhm, what do you mean with that?

        When we fired the blitter from direct reclaim, we get a fence. If we
wait for it in direct reclaim we will be sleeping waiting for gpu,
which is bad form. We'd like return a failure so the object is
retried
when idle, or from kswapd.

    Oh, that is a really good point. So basically you want to avoid
    dependencies on the dma_fence.

            4) It looks like we need to resort to folio_trylock in the shmem
backup
backend when shrinking is called for gfp_t = GFP_NOFS. A failing
trylock will require a new bo.

          Why would a folio trylock succeed with one BO and not another?

        Good point. We'd fail anyway but would perhaps need to call
SHRINK_STOP..

          And why would that trylock something the device specific driver
should 
handle?

        It happens in the TTM shrinker helper called from the driver in the
spirit of demidlayering.

            Arguments against:
None really. I thought the idea of demidlayering would be to
allow
the
driver more freedom.

          Well that is a huge argument against it. Giving drivers more
freedom
is 
absolutely not something which turned out to be valuable in the
past.

        So then what's the point of demidlayering?

    So that drivers can prepare the environment for TTM to work with
    instead of TTM asking for it.

    In your case for example that means powering up HW blocks so that
    BOs could be moved.

          Instead we should put device drivers in a very strict corset of 
validated approaches to do things.

Background is that in my experience driver developers are perfectly
willing to do unclean approaches which only work in like 99% of all
cases just to get a bit more performance or simpler driver
implementation.

Those approaches are not legal and in my opinion as subsystem
maintainer 
I think we need to be more strict and push back much harder on
stuff 
like that.

        Still, historically that has made developers abandon common
components
for driver-specific solutions. 

And the question is still not answered.

Exactly *why* can't the driver fail and continue traversing the LRU,
because all our argumentation revolves around this and you have yet
to
provide an objective reason why. 

      And in the end, while I think there definitely things worthy of
discussion in this series, 

I don't think there is a point in trying to land a LNL shrinker
operating against a crippled TTM LRU walk interface, nor do I think
anyone would want to attempt to port i915 over, and reworking it three
times I'm now pretty familiar with what works and what doesn't.

So question becomes, with proper reviews can I merge the series here as
is, *with* the de-midlayered LRU walk and noting your advise against
it, or not?

    More or less yes, I still have a bad feeling about it but we
    probably need to see the whole thing getting used to judge if it
    really works or not.

    I mean it's not UAPI we are talking about, so even if we find in
    5years from now that it was a bad idea we can still roll it back and
    try something else.

    So yeah, feel free to go ahead.

    Regards,

    Christian.

Thanks,
Thomas

        /Thomas

          Regards,
Christian.

            So any feedback appreciated. If that is found acceptable we can
proceed
with reviewing this patch and also with the shrinker series.

Thanks,
Thomas

On Mon, 2024-09-02 at 13:07 +0200, Daniel Vetter wrote:

              On Wed, Aug 28, 2024 at 02:20:34PM +0200, Christian König
wrote:

                Am 27.08.24 um 19:53 schrieb Daniel Vetter:

                  On Tue, Aug 27, 2024 at 06:52:13PM +0200, Daniel Vetter
wrote:

                    On Thu, Aug 22, 2024 at 03:19:29PM +0200, Christian König
wrote:

                      Completely agree that this is complicated, but I still
don't
see the need
for it.

Drivers just need to use pm_runtime_get_if_in_use()
inside
the shrinker and
postpone all hw activity until resume.

                    Not good enough, at least long term I think. Also
postponing hw
activity
to resume doesn't solve the deadlock issue, if you still
need
to grab ttm
locks on resume.

                  Pondered this specific aspect some more, and I think you
still
have a race
here (even if you avoid the deadlock): If the condiditional
rpm_get call
fails there's no guarantee that the device will
suspend/resume
and clean
up the GART mapping.

                Well I think we have a major disconnect here. When the device
is
powered
down there is no GART mapping to clean up any more.

In other words GART is a table in local memory (VRAM) when
the
device is
powered down this table is completely destroyed. Any BO which
was
mapped
inside this table is now not mapped any more.

So when the shrinker wants to evict a BO which is marked as
mapped
to GART
and the device is powered down we just skip the GART
unmapping
part
because
that has already implicitly happened during power down.

Before mapping any BO into the GART again we power the GPU up
through the
runtime PM calls. And while powering it up again the GART is
restored.

              My point is that you can't tell whether the device will power
down or
not,
you can only tell whether there's a chance it might be powering
down
and
so you can't get at the rpm reference without deadlock issues.

                  The race gets a bit smaller if you use
pm_runtime_get_if_active(), but even then you might catch
it
right when
resume almost finished.

                What race are you talking about?

The worst thing which could happen is that we restore a GART
entry
which
isn't needed any more, but that is pretty much irrelevant
since
we
only
clear them to avoid some hw bugs.

              The race I'm seeing is where you thought the GART entry is not
issue,
tossed an object, but the device didn't suspend, so might still
use
it.

I guess if we're clearly separating the sw allocation of the
TTM_TT
with
the physical entries in the GART that should all work, but
feels
a
bit
tricky. The race I've seen is essentially these two getting out
of
sync.

So maybe it was me who's stuck.

What I wonder is whether it works in practice, since on the
restore
side
you need to get some locks to figure out which gart mappings
exist
and
need restoring. And that's the same locks as the shrinker needs
to
figure
out whether it might need to reap a gart mapping.

Or do you just copy the gart entries over and restore them
exactly
as-is,
so that there's no shared locks?

                  That means we'll have ttm bo hanging around with GART
allocations/mappings
which aren't actually valid anymore (since they might
escape
the
cleanup
upon resume due to the race). That doesn't feel like a
solid
design
either.

                I'm most likely missing something, but I'm really scratching
my
head where
you see a problem here.

              I guess one issue is that at least traditionally, igfx drivers
have
nested
runtime pm within dma_resv lock. And dgpu drivers the other way
round.
Which is a bit awkward if you're trying for common code.

Cheers, Sima