Quoting Tvrtko Ursulin (2019-02-04 18:48:50) > > On 04/02/2019 13:22, Chris Wilson wrote > > -int __init i915_global_active_init(void) > > +int i915_global_active_init(void) > > These can't remain __init, since they are only called from the global > __init one? I ran into problems, and removed __init until it stopped complaining and I stopped caring. > > @@ -2916,12 +2917,11 @@ static void shrink_caches(struct drm_i915_private *i915) > > * filled slabs to prioritise allocating from the mostly full slabs, > > * with the aim of reducing fragmentation. > > */ > > - kmem_cache_shrink(i915->priorities); > > - kmem_cache_shrink(i915->dependencies); > > - kmem_cache_shrink(i915->requests); > > kmem_cache_shrink(i915->luts); > > kmem_cache_shrink(i915->vmas); > > kmem_cache_shrink(i915->objects); > > + > > + i915_globals_shrink(); > > This is the main bit which worries me. > > Global caches are what we want I think, exactly for what you wrote in > the commit message. But would one device going idle have the potential > to inject some latency into another potentially very busy client? > > Perhaps we could have some sort of aggregated idle signal and defer > shrinking cached to that point. Like a bitmask of global clients > reporting their idle/active status to global core, and then shrink > happens only if all are idle. I had mixed feelings too. I didn't want to completely discard the current logic, but this should be shrinking only when idle across all future stakeholders... or we demonstrate that shrinking has no effect on concurrent allocation latency. An active counter for unparking seems an easy way out. (Which today is this imaginary 1bit counter.) What I thought helped save this was this is done from a post-rcu worker, the system has to be pretty stable for us to start shrinking. We only clash with the first user to wake up. In the caller it does a loop over each cpu removing the local cpu cache and then the global slab cache. That is clearly going to increase latency for a concurrent caller. > > +int i915_global_scheduler_init(void) > > +{ > > + global.slab_dependencies = KMEM_CACHE(i915_dependency, > > + SLAB_HWCACHE_ALIGN); > > + if (!global.slab_dependencies) > > + return -ENOMEM; > > Right, so this slab is duplicated. It could end up merged by the core, > but I am thinking if this is the direction we want to go just to avoid > some pointer chasing. "some pointer chasing" :) The slab isn't necessary duplicated, that depends on compiletime policy. In debug environments or those more sensitive to performance, it will be private so that we can catch stray writes and what not. add/remove: 11/0 grow/shrink: 11/20 up/down: 595/-668 (-73) Function old new delta i915_global_request_init - 116 +116 i915_global_scheduler_init - 111 +111 igt_mock_ppgtt_misaligned_dma 679 748 +69 i915_globals_init - 53 +53 global 8 40 +32 i915_global_scheduler_shrink - 29 +29 i915_global_scheduler_exit - 29 +29 i915_global_request_shrink - 29 +29 i915_global_request_exit - 29 +29 i915_globals_shrink - 20 +20 __i915_priolist_free - 20 +20 i915_global_active_shrink - 17 +17 i915_globals_exit - 15 +15 live_suppress_wait_preempt.part.cold 202 211 +9 __err_print_to_sgl 4175 4181 +6 i915_global_active_exit 12 17 +5 intel_engine_lookup_user 54 55 +1 init_module 88 89 +1 igt_mock_ppgtt_misaligned_dma.cold 246 247 +1 i915_init 88 89 +1 gen11_irq_handler 733 734 +1 g4x_pre_enable_dp 345 346 +1 ring_request_alloc 1899 1898 -1 live_suppress_wait_preempt.part 1291 1290 -1 i915_sched_lookup_priolist 482 479 -3 i915_request_retire 1377 1373 -4 i915_request_await_dma_fence 547 543 -4 i915_fence_release 45 41 -4 __execlists_submission_tasklet 2121 2111 -10 i915_request_alloc 817 806 -11 i915_request_alloc_slow.isra 76 64 -12 i915_sched_node_add_dependency 114 101 -13 execlists_cancel_requests 690 676 -14 i915_sched_node_fini 459 444 -15 guc_submission_tasklet 1931 1916 -15 __sleep_work 106 75 -31 mock_device_release 407 371 -36 igt_mock_ppgtt_huge_fill 1108 1069 -39 i915_gem_cleanup_early 213 173 -40 igt_mock_ppgtt_huge_fill.cold 611 531 -80 mock_gem_device 1263 1102 -161 i915_gem_init_early 838 664 -174 __i915_priolist_free is the nasty one, but that's only hit for !I915_PRIORITY_NORMAL so I considered it to not be worth inlining. (I have no idea what the compiler is thinking changed in half those functions.) > You wouldn't consider i915->global->slab_dependencies or something along > those lines? I think for bits and bobs that are true globals like allocators, global variables do make sense. We either end up with pointers to a singleton, or just directly link them into the code. And I like the idea of easy wins. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/intel-gfx