Re: [PATCH 15/22] drm/i915: Make request allocation caches global

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Mon, 04 Feb 2019 21:26:25 +0000

Quoting Tvrtko Ursulin (2019-02-04 18:48:50)
> 
> On 04/02/2019 13:22, Chris Wilson wrote
> > -int __init i915_global_active_init(void)
> > +int i915_global_active_init(void)
> 
> These can't remain __init, since they are only called from the global 
> __init one?

I ran into problems, and removed __init until it stopped complaining and
I stopped caring.

> > @@ -2916,12 +2917,11 @@ static void shrink_caches(struct drm_i915_private *i915)
> >        * filled slabs to prioritise allocating from the mostly full slabs,
> >        * with the aim of reducing fragmentation.
> >        */
> > -     kmem_cache_shrink(i915->priorities);
> > -     kmem_cache_shrink(i915->dependencies);
> > -     kmem_cache_shrink(i915->requests);
> >       kmem_cache_shrink(i915->luts);
> >       kmem_cache_shrink(i915->vmas);
> >       kmem_cache_shrink(i915->objects);
> > +
> > +     i915_globals_shrink();
> 
> This is the main bit which worries me.
> 
> Global caches are what we want I think, exactly for what you wrote in 
> the commit message. But would one device going idle have the potential 
> to inject some latency into another potentially very busy client?
> 
> Perhaps we could have some sort of aggregated idle signal and defer 
> shrinking cached to that point. Like a bitmask of global clients 
> reporting their idle/active status to global core, and then shrink 
> happens only if all are idle.

I had mixed feelings too. I didn't want to completely discard the
current logic, but this should be shrinking only when idle across all
future stakeholders... or we demonstrate that shrinking has no effect on
concurrent allocation latency.

An active counter for unparking seems an easy way out. (Which today is
this imaginary 1bit counter.) What I thought helped save this was this
is done from a post-rcu worker, the system has to be pretty stable for
us to start shrinking. We only clash with the first user to wake up.

In the caller it does a loop over each cpu removing the local cpu cache
and then the global slab cache. That is clearly going to increase
latency for a concurrent caller.

> > +int i915_global_scheduler_init(void)
> > +{
> > +     global.slab_dependencies = KMEM_CACHE(i915_dependency,
> > +                                           SLAB_HWCACHE_ALIGN);
> > +     if (!global.slab_dependencies)
> > +             return -ENOMEM;
> 
> Right, so this slab is duplicated. It could end up merged by the core, 
> but I am thinking if this is the direction we want to go just to avoid 
> some pointer chasing.

"some pointer chasing" :)

The slab isn't necessary duplicated, that depends on compiletime policy.
In debug environments or those more sensitive to performance, it will be
private so that we can catch stray writes and what not.

add/remove: 11/0 grow/shrink: 11/20 up/down: 595/-668 (-73)
Function                                     old     new   delta
i915_global_request_init                       -     116    +116
i915_global_scheduler_init                     -     111    +111
igt_mock_ppgtt_misaligned_dma                679     748     +69
i915_globals_init                              -      53     +53
global                                         8      40     +32
i915_global_scheduler_shrink                   -      29     +29
i915_global_scheduler_exit                     -      29     +29
i915_global_request_shrink                     -      29     +29
i915_global_request_exit                       -      29     +29
i915_globals_shrink                            -      20     +20
__i915_priolist_free                           -      20     +20
i915_global_active_shrink                      -      17     +17
i915_globals_exit                              -      15     +15
live_suppress_wait_preempt.part.cold         202     211      +9
__err_print_to_sgl                          4175    4181      +6
i915_global_active_exit                       12      17      +5
intel_engine_lookup_user                      54      55      +1
init_module                                   88      89      +1
igt_mock_ppgtt_misaligned_dma.cold           246     247      +1
i915_init                                     88      89      +1
gen11_irq_handler                            733     734      +1
g4x_pre_enable_dp                            345     346      +1
ring_request_alloc                          1899    1898      -1
live_suppress_wait_preempt.part             1291    1290      -1
i915_sched_lookup_priolist                   482     479      -3
i915_request_retire                         1377    1373      -4
i915_request_await_dma_fence                 547     543      -4
i915_fence_release                            45      41      -4
__execlists_submission_tasklet              2121    2111     -10
i915_request_alloc                           817     806     -11
i915_request_alloc_slow.isra                  76      64     -12
i915_sched_node_add_dependency               114     101     -13
execlists_cancel_requests                    690     676     -14
i915_sched_node_fini                         459     444     -15
guc_submission_tasklet                      1931    1916     -15
__sleep_work                                 106      75     -31
mock_device_release                          407     371     -36
igt_mock_ppgtt_huge_fill                    1108    1069     -39
i915_gem_cleanup_early                       213     173     -40
igt_mock_ppgtt_huge_fill.cold                611     531     -80
mock_gem_device                             1263    1102    -161
i915_gem_init_early                          838     664    -174

__i915_priolist_free is the nasty one, but that's only hit for
!I915_PRIORITY_NORMAL so I considered it to not be worth inlining.

(I have no idea what the compiler is thinking changed in half those
functions.)

> You wouldn't consider i915->global->slab_dependencies or something along 
> those lines?

I think for bits and bobs that are true globals like allocators, global
variables do make sense. We either end up with pointers to a singleton,
or just directly link them into the code. And I like the idea of easy
wins.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx