Re: [PATCH 4/7] drm/ttm: move LRU walk defines into new internal header

Christian König <christian.koenig@xxxxxxx> · Thu, 22 Aug 2024 11:29:24 +0200

    Am 22.08.24 um 10:21 schrieb Thomas Hellström:

      On Thu, 2024-08-22 at 09:55 +0200, Christian König wrote:

        Am 22.08.24 um 08:47 schrieb Thomas Hellström:

                As Sima said, this is complicated but not beyond
comprehension:
i915
https://elixir.bootlin.com/linux/v6.11-rc4/source/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c#L317

              As far as I can tell what i915 does here is extremely
questionable.

      if (sc->nr_scanned < sc->nr_to_scan &&
current_is_kswapd()) {
....
          with_intel_runtime_pm(&i915->runtime_pm, wakeref) {

with_intel_runtime_pm() then calls pm_runtime_get_sync().

So basically the i915 shrinker assumes that when called from
kswapd()
that it can synchronously wait for runtime PM to power up the
device
again.

As far as I can tell that means that a device driver makes
strong
and
completely undocumented assumptions how kswapd works
internally.

            Admittedly that looks weird

But I'd really expect a reclaim lockdep splat to happen there if
the
i915 pm did something not-allowed. IIRC, the design direction the
i915
people got from mm people regarding the shrinkers was to avoid
any
sleeps in direct reclaim and punt it to kswapd. Need to ask i915
people
how they can get away with that.

          So it turns out that Xe integrated pm resume is reclaim-safe, and
I'd
expect i915's to be as well. Xe discrete pm resume isn't.

So that means that, at least for integrated, the i915 shrinker
should
be ok from that POW, and punting certain bos to kswapd is not
AFAICT
abusing any undocumented features of kswapd but rather a way to
avoid
resuming the device during direct reclaim, like documented.

        The more I think about this the more I disagree to this driver
design. 
In my opinion device drivers should *never* resume runtime PM in a 
shrinker callback in the first place.

      Runtime PM resume is allowed even from irq context if carefully
implemented by the driver and flagged as such to the core. 

https://docs.kernel.org/power/runtime_pm.html

Resuming runtime PM from reclaim therefore shouldn't be an issue IMO,
and really up to the driver.

    Mhm when it's up to the driver on which level to use runtime PM then
    that at least explains why the framework doesn't have lockdep
    annotations. 

    Ok, that is at least convincing the what i915 does here should work
    somehow.

        When the device is turned off it means that all of it's operations
are 
stopped and eventually power to caches etc turned off as well. So I 
don't see any ongoing writeback operations or similar either.

So the question is why do we need to power on the device in a
shrinker 
in the first place?

What could be is that the device needs to flush GART TLBs or similar 
when it is turned on, e.g. that you grab a PM reference to make sure 
that during your HW operation the device doesn't suspend.

      Exactly why the i915 needs to flush the GART I'm not sure of but I
suspect the gart TLB might be forgotten while suspended.

    Well that is unproblematic. Amdgpu and I think nouveau does
    something similar.

    But you don't need to resume the hardware for this, just grabbing
    the reference to make sure that it doesn't suspend is sufficient.

    The assumption I make here is that you don't need to do anything
    when the hardware is power down anyway. That seems to be true for at
    least the hardware designs I'm aware of.

        But that doesn't mean that you should resume the device. In other
words 
when the device is powered down you shouldn't power it up again.

And for GART we already have the necessary move callback implemented
in 
TTM. This is done by radeon, amdgpu and nouveu in a common way as far
as 
I can see.

So why should Xe be special and follow the very questionable approach
of 
i915 here?

      For Xe, Lunar Lake (integrated) has the interesting design that each bo
carries compression metadata that needs to be blitted to system pages
during shrinking. The alternative is to resolve all buffer objects at
device runtime suspend...

    That's the same for amdgpu as well, but when the device is powered
    down those compression data needs to be evacuated anyway.

      But runtime PM aside, with a one-bo only approach we still have the
drawbacks that it 

* eliminates possibility for driver deadlock avoidance
* Requires TTM knowledge of "purgeable" bos
* Requires an additional LRU to avoid O(n2) traversal of already
shrunken objects
* Drivers with legitimate shrinker designs that don't fit in the TTM-
enforced model will have frustrated maintainers.

    I still find that only halve-convincing. The real question is if
    it's a good idea to give drivers the power to decide what to shrink
    and what not to shrink.

    And at least with the arguments and experience at hand I would vote
    for not doing that. We have added the eviction_valuable callback for
    amdgpu and ended up in quite a mess with that.

    Background is that some eviction decision done by the driver where
    not as optimal as we hoped it to be.

    On the other hand keeping track of all the swapped out objects
    should be TTMs job anyway, e.g. having a TTM_PL_SWAPPED domain.

    So in my mind the ideal solution still looks like this:

    driver_specific_shrinker_scan(...)

    {

        driver_specific_preparations(...);

        bo = ttm_reserve_next_bo_to_shrink(...);

        ttm_bo_validate(bo, TTM_PL_SWAPPED);

        ttm_bo_unreserver(bo);

        driver_specific_cleanups(...);

    }

    When there is a potential deadlock because the shrinker might be
    called from driver code which holds locks the driver needs to it's
    specific preparation or cleanup then those would apply to all BOs
    and not just the one returned from TTM.

    The only use case I can see were the driver would need to filter out
    the BOs to shrink would be if TTM doesn't know about all the
    information to make a decision what to shrink and exactly that is
    what I try to avoid.

    Regards,

    Christian.

Thanks,
Thomas

        Regards,
Christian.

          /Thomas