Re: [Intel-gfx] [PATCH] drm/i915: Ditch i915 globals shrink infrastructure

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Thu, 22 Jul 2021 15:02:32 +0100

On 22/07/2021 14:37, Jason Ekstrand wrote:
On Thu, Jul 22, 2021 at 5:34 AM Tvrtko Ursulin
<tvrtko.ursulin@xxxxxxxxxxxxxxx> wrote:
On 22/07/2021 11:16, Daniel Vetter wrote:
On Thu, Jul 22, 2021 at 11:02:55AM +0100, Tvrtko Ursulin wrote:
On 21/07/2021 19:32, Daniel Vetter wrote:
This essentially reverts

commit 84a1074920523430f9dc30ff907f4801b4820072
Author: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Date:   Wed Jan 24 11:36:08 2018 +0000

       drm/i915: Shrink the GEM kmem_caches upon idling

mm/vmscan.c:do_shrink_slab() is a thing, if there's an issue with it
then we need to fix that there, not hand-roll our own slab shrinking
code in i915.

This is somewhat incomplete statement which ignores a couple of angles so I
wish there was a bit more time to respond before steam rolling it in. :(

The removed code was not a hand rolled shrinker, but about managing slab
sizes in face of bursty workloads. Core code does not know when i915 is
active and when it is idle, so calling kmem_cache_shrink() after going idle
wass supposed to help with house keeping by doing house keeping work outside
of the latency sensitive phase.

To "fix" (improve really) it in core as you suggest, would need some method
of signaling when a slab user feels is an opportunte moment to do this house
keeping. And kmem_cache_shrink is just that so I don't see the problem.

Granted, argument kmem_cache_shrink is not much used is a valid one so
discussion overall is definitely valid. Becuase on the higher level we could
definitely talk about which workloads actually benefit from this code and
how much which probably no one knows at this point.

Pardon me for being a bit curt here, but that discussion should have
happened 3.5 years ago when this landed.  The entire justification we
have on record for this change is, "When we finally decide the gpu is
idle, that is a good time to shrink our kmem_caches."  We have no
record of any workloads which benefit from this and no recorded way to
reproduce any supposed benefits, even if it requires a microbenchmark.
But we added over 100 lines of code for it anyway, including a bunch
of hand-rolled RCU juggling.  Ripping out unjustified complexity is
almost always justified, IMO.  The burden of proof here isn't on
Daniel to show he isn't regressing anything but it was on you and
Chris to show that complexity was worth something back in 2018 when
this landed.

It feels like there is so much knee-jerk when looking at code added by 
Chris which often results in not reading properly what I wrote.

For instance I did not ask for any proof of no regressions, neither I 
claimed any regressions. In fact I said clearly that at this point it is 
not known what benefited from it. Statement at the time wasn't clear so 
you would need to ask Chris whether he remembers any better than what I 
can find in mailing list archives. Heck I even said the argument to 
remove is completely valid..

Point is, process used to be more inclusive and IMO there is no 
technical justification to fast track this type of change. Compared to 
other work in progress there was approaching zero maintenance cost with 
this.

Besides, mm folks may still say that it is good hygiene to tidy own 
slabs at opportune moments. Maybe it is a stretch but we don't know if 
we don't ask. There are certainly online references to slab reclaim 
being problematic in the past. There was nothing urgent in this "revert" 
which couldn't have waited a bit longer, or at least _some_ of the 
involved people copied.

Regards,

Tvrtko