On Thu, 2024-10-10 at 10:00 +0200, Christian König wrote: > Am 09.10.24 um 16:17 schrieb Thomas Hellström: > > On Wed, 2024-10-09 at 15:39 +0200, Thomas Hellström wrote: > > > On Mon, 2024-10-07 at 11:08 +0200, Christian König wrote: > > > > Hi Thomas, > > > > > > > > I'm on sick leave, but I will try to answer questions and share > > > > some > > > > thoughts on this to unblock you. > > > > > > > > Am 18.09.24 um 14:57 schrieb Thomas Hellström: > > > > > Sima, Christian > > > > > > > > > > I've updated the shrinker series now with a guarded for_each > > > > > macro > > > > > instead: > > > > > > > > > > https://patchwork.freedesktop.org/patch/614514/?series=131815&rev=9 > > > > Yeah that looks like a big step in the right direction. > > > > > > > > > (Note I forgot to remove the export of the previous LRU > > > > > walker). > > > > > > > > > > so the midlayer argument is now not an issue anymore. The > > > > > cleanup.h > > > > > guard provides some additional protection against drivers > > > > > exiting > > > > > the > > > > > LRU loop early. > > > > > > > > > > So remaining is the question whether the driver is allowed to > > > > > discard a > > > > > suggested bo to shrink from TTM. > > > > > > > > > > Arguments for: > > > > > > > > > > 1) Not allowing that would require teaching TTM about > > > > > purgeable > > > > > objects. > > > > I think that is actually not correct. TTM already knows about > > > > purgeable > > > > objects. > > > > > > > > E.g. when TTM asks the driver what to do with evicted objects > > > > it is > > > > perfectly valid to return an empty placement list indicating > > > > that > > > > the > > > > backing store can just be thrown away. > > > > > > > > We use this for things like temporary buffers for example. > > > > > > > > That this doesn't apply to swapping looks like a design bug in > > > > the > > > > driver/TTM interface to me. > > > Yes we can do that with TTM, but for shrinking there's no point > > > in > > > trying to shrink when there is no swap-space left, other than > > > purgeable > > > object. The number of shrinkable objects returned in > > > shrinker::count > > > needs to reflect that, and *then* only those objects should be > > > selected > > > for shrinking. If we were to announce that to TTM using a > > > callback, > > > we're actually back to version 1 of this series which was > > > rejected by > > > you exactly since it was using callbacks a year or so ago???? > > Yeah that indeed doesn't sound like a good idea. > > On the other hand the decision that only purgeable objects should be > selected when there is no swap space left sounds like something TTM > should do and not the driver. > > > > > > 2) Devices who need the blitter during shrinking would want > > > > > to > > > > > punt > > > > > runtime_pm_get() to kswapd to avoid sleeping direct reclaim. > > > > I think the outcome of the discussion is that runtime PM should > > > > never > > > > be > > > > mixed into TTM swapping. > > > > > > > > You can power up blocks to enable a HW blitter for swapping, > > > > but > > > > this > > > > then can't be driven by the runtime PM framework. > > > Still that power-on might be sleeping, so what's the difference > > > using > > > runtime-pm or not? Why should the driver implement yet another > > > power > > > interface, just because TTM refuses to publish a sane LRU walk > > > interface? > > See the discussion with Sima around the PM functions. > > My understanding might be wrong, but I now think that with local > memory > you can't do the i915 approach where the PM functions are so low > level > that they can also be called inside the shrinker. > > So you basically have PM functions for your whole device and PM > functions for only the HW blocks necessary for the shrinker. > > > > > > > > > 3) If those devices end up blitting (LNL) to be able to > > > > > shrink, > > > > > they > > > > > would want to punt waiting for the fence to signal to kswapd > > > > > to > > > > > avoid > > > > > waiting in direct reclaim. > > > > Mhm, what do you mean with that? > > > When we fired the blitter from direct reclaim, we get a fence. If > > > we > > > wait for it in direct reclaim we will be sleeping waiting for > > > gpu, > > > which is bad form. We'd like return a failure so the object is > > > retried > > > when idle, or from kswapd. > > Oh, that is a really good point. So basically you want to avoid > dependencies on the dma_fence. > > > > > > 4) It looks like we need to resort to folio_trylock in the > > > > > shmem > > > > > backup > > > > > backend when shrinking is called for gfp_t = GFP_NOFS. A > > > > > failing > > > > > trylock will require a new bo. > > > > Why would a folio trylock succeed with one BO and not another? > > > Good point. We'd fail anyway but would perhaps need to call > > > SHRINK_STOP.. > > > > > > > And why would that trylock something the device specific driver > > > > should > > > > handle? > > > It happens in the TTM shrinker helper called from the driver in > > > the > > > spirit of demidlayering. > > > > > > > > Arguments against: > > > > > None really. I thought the idea of demidlayering would be to > > > > > allow > > > > > the > > > > > driver more freedom. > > > > Well that is a huge argument against it. Giving drivers more > > > > freedom > > > > is > > > > absolutely not something which turned out to be valuable in the > > > > past. > > > So then what's the point of demidlayering? > > So that drivers can prepare the environment for TTM to work with > instead > of TTM asking for it. > > In your case for example that means powering up HW blocks so that BOs > could be moved. > > > > > Instead we should put device drivers in a very strict corset of > > > > validated approaches to do things. > > > > > > > > Background is that in my experience driver developers are > > > > perfectly > > > > willing to do unclean approaches which only work in like 99% of > > > > all > > > > cases just to get a bit more performance or simpler driver > > > > implementation. > > > > > > > > Those approaches are not legal and in my opinion as subsystem > > > > maintainer > > > > I think we need to be more strict and push back much harder on > > > > stuff > > > > like that. > > > Still, historically that has made developers abandon common > > > components > > > for driver-specific solutions. > > > > > > And the question is still not answered. > > > > > > Exactly *why* can't the driver fail and continue traversing the > > > LRU, > > > because all our argumentation revolves around this and you have > > > yet > > > to > > > provide an objective reason why. > > And in the end, while I think there definitely things worthy of > > discussion in this series, > > > > I don't think there is a point in trying to land a LNL shrinker > > operating against a crippled TTM LRU walk interface, nor do I think > > anyone would want to attempt to port i915 over, and reworking it > > three > > times I'm now pretty familiar with what works and what doesn't. > > > > So question becomes, with proper reviews can I merge the series > > here as > > is, *with* the de-midlayered LRU walk and noting your advise > > against > > it, or not? > > More or less yes, I still have a bad feeling about it but we probably > need to see the whole thing getting used to judge if it really works > or not. > > I mean it's not UAPI we are talking about, so even if we find in > 5years > from now that it was a bad idea we can still roll it back and try > something else. > > So yeah, feel free to go ahead. Thanks, I'll respin a version moving the checks you pointed out, (get_nr_swap_pages() + all other checks TTM can really do) etc, into TTM helpers to simplify such a change in the future if it becomes needed. /Thomas > > Regards, > Christian. > > > > > > Thanks, > > Thomas > > > > > > > > > > > > > /Thomas > > > > > > > > > > > > > > > > Regards, > > > > Christian. > > > > > > > > > So any feedback appreciated. If that is found acceptable we > > > > > can > > > > > proceed > > > > > with reviewing this patch and also with the shrinker series. > > > > > > > > > > Thanks, > > > > > Thomas > > > > > > > > > > > > > > > On Mon, 2024-09-02 at 13:07 +0200, Daniel Vetter wrote: > > > > > > On Wed, Aug 28, 2024 at 02:20:34PM +0200, Christian König > > > > > > wrote: > > > > > > > Am 27.08.24 um 19:53 schrieb Daniel Vetter: > > > > > > > > On Tue, Aug 27, 2024 at 06:52:13PM +0200, Daniel Vetter > > > > > > > > wrote: > > > > > > > > > On Thu, Aug 22, 2024 at 03:19:29PM +0200, Christian > > > > > > > > > König > > > > > > > > > wrote: > > > > > > > > > > Completely agree that this is complicated, but I > > > > > > > > > > still > > > > > > > > > > don't > > > > > > > > > > see the need > > > > > > > > > > for it. > > > > > > > > > > > > > > > > > > > > Drivers just need to use pm_runtime_get_if_in_use() > > > > > > > > > > inside > > > > > > > > > > the shrinker and > > > > > > > > > > postpone all hw activity until resume. > > > > > > > > > Not good enough, at least long term I think. Also > > > > > > > > > postponing hw > > > > > > > > > activity > > > > > > > > > to resume doesn't solve the deadlock issue, if you > > > > > > > > > still > > > > > > > > > need > > > > > > > > > to grab ttm > > > > > > > > > locks on resume. > > > > > > > > Pondered this specific aspect some more, and I think > > > > > > > > you > > > > > > > > still > > > > > > > > have a race > > > > > > > > here (even if you avoid the deadlock): If the > > > > > > > > condiditional > > > > > > > > rpm_get call > > > > > > > > fails there's no guarantee that the device will > > > > > > > > suspend/resume > > > > > > > > and clean > > > > > > > > up the GART mapping. > > > > > > > Well I think we have a major disconnect here. When the > > > > > > > device > > > > > > > is > > > > > > > powered > > > > > > > down there is no GART mapping to clean up any more. > > > > > > > > > > > > > > In other words GART is a table in local memory (VRAM) > > > > > > > when > > > > > > > the > > > > > > > device is > > > > > > > powered down this table is completely destroyed. Any BO > > > > > > > which > > > > > > > was > > > > > > > mapped > > > > > > > inside this table is now not mapped any more. > > > > > > > > > > > > > > So when the shrinker wants to evict a BO which is marked > > > > > > > as > > > > > > > mapped > > > > > > > to GART > > > > > > > and the device is powered down we just skip the GART > > > > > > > unmapping > > > > > > > part > > > > > > > because > > > > > > > that has already implicitly happened during power down. > > > > > > > > > > > > > > Before mapping any BO into the GART again we power the > > > > > > > GPU up > > > > > > > through the > > > > > > > runtime PM calls. And while powering it up again the GART > > > > > > > is > > > > > > > restored. > > > > > > My point is that you can't tell whether the device will > > > > > > power > > > > > > down or > > > > > > not, > > > > > > you can only tell whether there's a chance it might be > > > > > > powering > > > > > > down > > > > > > and > > > > > > so you can't get at the rpm reference without deadlock > > > > > > issues. > > > > > > > > > > > > > > The race gets a bit smaller if you use > > > > > > > > pm_runtime_get_if_active(), but even then you might > > > > > > > > catch > > > > > > > > it > > > > > > > > right when > > > > > > > > resume almost finished. > > > > > > > What race are you talking about? > > > > > > > > > > > > > > The worst thing which could happen is that we restore a > > > > > > > GART > > > > > > > entry > > > > > > > which > > > > > > > isn't needed any more, but that is pretty much irrelevant > > > > > > > since > > > > > > > we > > > > > > > only > > > > > > > clear them to avoid some hw bugs. > > > > > > The race I'm seeing is where you thought the GART entry is > > > > > > not > > > > > > issue, > > > > > > tossed an object, but the device didn't suspend, so might > > > > > > still > > > > > > use > > > > > > it. > > > > > > > > > > > > I guess if we're clearly separating the sw allocation of > > > > > > the > > > > > > TTM_TT > > > > > > with > > > > > > the physical entries in the GART that should all work, but > > > > > > feels > > > > > > a > > > > > > bit > > > > > > tricky. The race I've seen is essentially these two getting > > > > > > out > > > > > > of > > > > > > sync. > > > > > > > > > > > > So maybe it was me who's stuck. > > > > > > > > > > > > What I wonder is whether it works in practice, since on the > > > > > > restore > > > > > > side > > > > > > you need to get some locks to figure out which gart > > > > > > mappings > > > > > > exist > > > > > > and > > > > > > need restoring. And that's the same locks as the shrinker > > > > > > needs > > > > > > to > > > > > > figure > > > > > > out whether it might need to reap a gart mapping. > > > > > > > > > > > > Or do you just copy the gart entries over and restore them > > > > > > exactly > > > > > > as-is, > > > > > > so that there's no shared locks? > > > > > > > > > > > > > > That means we'll have ttm bo hanging around with GART > > > > > > > > allocations/mappings > > > > > > > > which aren't actually valid anymore (since they might > > > > > > > > escape > > > > > > > > the > > > > > > > > cleanup > > > > > > > > upon resume due to the race). That doesn't feel like a > > > > > > > > solid > > > > > > > > design > > > > > > > > either. > > > > > > > I'm most likely missing something, but I'm really > > > > > > > scratching > > > > > > > my > > > > > > > head where > > > > > > > you see a problem here. > > > > > > I guess one issue is that at least traditionally, igfx > > > > > > drivers > > > > > > have > > > > > > nested > > > > > > runtime pm within dma_resv lock. And dgpu drivers the other > > > > > > way > > > > > > round. > > > > > > Which is a bit awkward if you're trying for common code. > > > > > > > > > > > > Cheers, Sima