On 22/07/2021 14:37, Jason Ekstrand wrote:
On Thu, Jul 22, 2021 at 5:34 AM Tvrtko Ursulin
<tvrtko.ursulin@xxxxxxxxxxxxxxx> wrote:
On 22/07/2021 11:16, Daniel Vetter wrote:
On Thu, Jul 22, 2021 at 11:02:55AM +0100, Tvrtko Ursulin wrote:
On 21/07/2021 19:32, Daniel Vetter wrote:
This essentially reverts
commit 84a1074920523430f9dc30ff907f4801b4820072
Author: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Date: Wed Jan 24 11:36:08 2018 +0000
drm/i915: Shrink the GEM kmem_caches upon idling
mm/vmscan.c:do_shrink_slab() is a thing, if there's an issue with it
then we need to fix that there, not hand-roll our own slab shrinking
code in i915.
This is somewhat incomplete statement which ignores a couple of angles so I
wish there was a bit more time to respond before steam rolling it in. :(
The removed code was not a hand rolled shrinker, but about managing slab
sizes in face of bursty workloads. Core code does not know when i915 is
active and when it is idle, so calling kmem_cache_shrink() after going idle
wass supposed to help with house keeping by doing house keeping work outside
of the latency sensitive phase.
To "fix" (improve really) it in core as you suggest, would need some method
of signaling when a slab user feels is an opportunte moment to do this house
keeping. And kmem_cache_shrink is just that so I don't see the problem.
Granted, argument kmem_cache_shrink is not much used is a valid one so
discussion overall is definitely valid. Becuase on the higher level we could
definitely talk about which workloads actually benefit from this code and
how much which probably no one knows at this point.
Pardon me for being a bit curt here, but that discussion should have
happened 3.5 years ago when this landed. The entire justification we
have on record for this change is, "When we finally decide the gpu is
idle, that is a good time to shrink our kmem_caches." We have no
record of any workloads which benefit from this and no recorded way to
reproduce any supposed benefits, even if it requires a microbenchmark.
But we added over 100 lines of code for it anyway, including a bunch
of hand-rolled RCU juggling. Ripping out unjustified complexity is
almost always justified, IMO. The burden of proof here isn't on
Daniel to show he isn't regressing anything but it was on you and
Chris to show that complexity was worth something back in 2018 when
this landed.
It feels like there is so much knee-jerk when looking at code added by
Chris which often results in not reading properly what I wrote.
For instance I did not ask for any proof of no regressions, neither I
claimed any regressions. In fact I said clearly that at this point it is
not known what benefited from it. Statement at the time wasn't clear so
you would need to ask Chris whether he remembers any better than what I
can find in mailing list archives. Heck I even said the argument to
remove is completely valid..
Point is, process used to be more inclusive and IMO there is no
technical justification to fast track this type of change. Compared to
other work in progress there was approaching zero maintenance cost with
this.
Besides, mm folks may still say that it is good hygiene to tidy own
slabs at opportune moments. Maybe it is a stretch but we don't know if
we don't ask. There are certainly online references to slab reclaim
being problematic in the past. There was nothing urgent in this "revert"
which couldn't have waited a bit longer, or at least _some_ of the
involved people copied.
Regards,
Tvrtko