Re: [PATCH 4/7] drm/ttm: move LRU walk defines into new internal header

Christian König <christian.koenig@xxxxxxx> · Wed, 2 Oct 2024 13:32:08 +0200

Ah, yes sorry totally forgotten about that.

Give me till Friday to swap everything back into my head again.

Christian.

Am 02.10.24 um 13:30 schrieb Thomas Hellström:
Hi, Christian,

Ping? Can i get an ack to proceed with this?

Thanks,
Thomas

On Wed, 2024-09-18 at 14:57 +0200, Thomas Hellström wrote:
Sima, Christian

I've updated the shrinker series now with a guarded for_each macro
instead:

https://patchwork.freedesktop.org/patch/614514/?series=131815&rev=9

(Note I forgot to remove the export of the previous LRU walker).

  so the midlayer argument is now not an issue anymore. The cleanup.h
guard provides some additional protection against drivers exiting the
LRU loop early.

So remaining is the question whether the driver is allowed to discard
a
suggested bo to shrink from TTM.

Arguments for:

1) Not allowing that would require teaching TTM about purgeable
objects.
2) Devices who need the blitter during shrinking would want to punt
runtime_pm_get() to kswapd to avoid sleeping direct reclaim.
3) If those devices end up blitting (LNL) to be able to shrink, they
would want to punt waiting for the fence to signal to kswapd to avoid
waiting in direct reclaim.
4) It looks like we need to resort to folio_trylock in the shmem
backup
backend when shrinking is called for gfp_t = GFP_NOFS. A failing
trylock will require a new bo.

Arguments against:
None really. I thought the idea of demidlayering would be to allow
the
driver more freedom.

So any feedback appreciated. If that is found acceptable we can
proceed
with reviewing this patch and also with the shrinker series.

Thanks,
Thomas

On Mon, 2024-09-02 at 13:07 +0200, Daniel Vetter wrote:
On Wed, Aug 28, 2024 at 02:20:34PM +0200, Christian König wrote:
Am 27.08.24 um 19:53 schrieb Daniel Vetter:
On Tue, Aug 27, 2024 at 06:52:13PM +0200, Daniel Vetter wrote:
On Thu, Aug 22, 2024 at 03:19:29PM +0200, Christian König
wrote:
Completely agree that this is complicated, but I still
don't
see the need
for it.

Drivers just need to use pm_runtime_get_if_in_use() inside
the shrinker and
postpone all hw activity until resume.
Not good enough, at least long term I think. Also postponing
hw
activity
to resume doesn't solve the deadlock issue, if you still need
to grab ttm
locks on resume.
Pondered this specific aspect some more, and I think you still
have a race
here (even if you avoid the deadlock): If the condiditional
rpm_get call
fails there's no guarantee that the device will suspend/resume
and clean
up the GART mapping.
Well I think we have a major disconnect here. When the device is
powered
down there is no GART mapping to clean up any more.

In other words GART is a table in local memory (VRAM) when the
device is
powered down this table is completely destroyed. Any BO which was
mapped
inside this table is now not mapped any more.

So when the shrinker wants to evict a BO which is marked as
mapped
to GART
and the device is powered down we just skip the GART unmapping
part
because
that has already implicitly happened during power down.

Before mapping any BO into the GART again we power the GPU up
through the
runtime PM calls. And while powering it up again the GART is
restored.
My point is that you can't tell whether the device will power down
or
not,
you can only tell whether there's a chance it might be powering
down
and
so you can't get at the rpm reference without deadlock issues.

The race gets a bit smaller if you use
pm_runtime_get_if_active(), but even then you might catch it
right when
resume almost finished.
What race are you talking about?

The worst thing which could happen is that we restore a GART
entry
which
isn't needed any more, but that is pretty much irrelevant since
we
only
clear them to avoid some hw bugs.
The race I'm seeing is where you thought the GART entry is not
issue,
tossed an object, but the device didn't suspend, so might still use
it.

I guess if we're clearly separating the sw allocation of the TTM_TT
with
the physical entries in the GART that should all work, but feels a
bit
tricky. The race I've seen is essentially these two getting out of
sync.

So maybe it was me who's stuck.

What I wonder is whether it works in practice, since on the restore
side
you need to get some locks to figure out which gart mappings exist
and
need restoring. And that's the same locks as the shrinker needs to
figure
out whether it might need to reap a gart mapping.

Or do you just copy the gart entries over and restore them exactly
as-is,
so that there's no shared locks?

That means we'll have ttm bo hanging around with GART
allocations/mappings
which aren't actually valid anymore (since they might escape
the
cleanup
upon resume due to the race). That doesn't feel like a solid
design
either.
I'm most likely missing something, but I'm really scratching my
head where
you see a problem here.
I guess one issue is that at least traditionally, igfx drivers have
nested
runtime pm within dma_resv lock. And dgpu drivers the other way
round.
Which is a bit awkward if you're trying for common code.

Cheers, Sima