Re: [RFC v2 00/34] SLUB: reduce irq disabled scope and make it RT compatible

Vlastimil Babka <vbabka@xxxxxxx> · Fri, 2 Jul 2021 22:25:05 +0200

On 7/2/21 8:29 PM, Sebastian Andrzej Siewior wrote:
> I replaced my slub changes with slub-local-lock-v2r3.
> I haven't seen any complains from lockdep or so which is good. Then I
> did this with RT enabled (and no debug):

Thanks for testing!

> - A "time make -j32" run of allmodconfig on /dev/shm.
>   Old:
> | real    20m6,217s
> | user    568m22,553s
> | sys     48m33,126s
> 
>   New:
> | real    20m9,049s
> | user    569m32,096s
> | sys     48m47,670s
> 
>   These 3 seconds here are probably in the noise range.
> 
> - perf_5.10 stat -r 10 hackbench -g200 -s 4096 -l500
> Old:
> |         464.967,20 msec task-clock                #   27,220 CPUs utilized            ( +-  0,16% )
> |          7.683.944      context-switches          #    0,017 M/sec                    ( +-  0,86% )
> |            931.380      cpu-migrations            #    0,002 M/sec                    ( +-  4,94% )
> |            219.569      page-faults               #    0,472 K/sec                    ( +-  0,39% )
> |  1.104.727.599.918      cycles                    #    2,376 GHz                      ( +-  0,18% )
> |    941.428.898.087      stalled-cycles-frontend   #   85,22% frontend cycles idle     ( +-  0,24% )
> |    729.016.546.572      stalled-cycles-backend    #   65,99% backend cycles idle      ( +-  0,32% )
> |    340.133.571.519      instructions              #    0,31  insn per cycle
> |                                                   #    2,77  stalled cycles per insn  ( +-  0,12% )
> |     73.746.821.314      branches                  #  158,607 M/sec                    ( +-  0,13% )
> |        377.838.006      branch-misses             #    0,51% of all branches          ( +-  1,01% )
> | 
> |            17,0820 +- 0,0202 seconds time elapsed  ( +-  0,12% )
> 
> New:
> |         422.865,71 msec task-clock                #    4,782 CPUs utilized            ( +-  0,34% )
> |         14.594.238      context-switches          #    0,035 M/sec                    ( +-  0,43% )
> |          3.737.926      cpu-migrations            #    0,009 M/sec                    ( +-  0,46% )
> |            218.474      page-faults               #    0,517 K/sec                    ( +-  0,74% )
> |    940.715.812.020      cycles                    #    2,225 GHz                      ( +-  0,34% )
> |    716.593.827.820      stalled-cycles-frontend   #   76,18% frontend cycles idle     ( +-  0,39% )
> |    550.730.862.839      stalled-cycles-backend    #   58,54% backend cycles idle      ( +-  0,43% )
> |    417.274.588.907      instructions              #    0,44  insn per cycle
> |                                                   #    1,72  stalled cycles per insn  ( +-  0,17% )
> |     92.814.150.290      branches                  #  219,488 M/sec                    ( +-  0,17% )
> |        822.102.170      branch-misses             #    0,89% of all branches          ( +-  0,41% )
> | 
> |             88,427 +- 0,618 seconds time elapsed  ( +-  0,70% )
> 
> So this is outside of the noise range.
> I'm not sure where this is coming from. My guess would be higher lock
> contention within the memory allocator.

The series shouldn't significantly change the memory allocator
interaction, though.
Seems there's less cycles, but more time elapsed, thus more sleeping -
is it locks becoming mutexes on RT?

My first guess - the last, local_lock patch. What would happen if you
take that one out? Should be still RT-compatible. If it improves a lot,
maybe that conversion to local_lock is not worth it then.

My second guess - list_lock remains spinlock with my series, thus RT
mutex, but the current RT tree converts it to raw_spinlock. I'd hope
leaving that one as non-raw spinlock would still be much better for RT
goals, even if hackbench (which is AFAIK very slab intensive) throughput
regresses - hopefully not that much.

>> The remaining patches to upstream from the RT tree are small ones related to
>> KConfig. The patch that restricts PREEMPT_RT to SLUB (not SLAB or SLOB) makes
>> sense. The patch that disables CONFIG_SLUB_CPU_PARTIAL with PREEMPT_RT could
>> perhaps be re-evaluated as the series also addresses some latency issues with
>> percpu partial slabs.
> 
> With that series the PARTIAL slab can be indeed enabled. I have (had) a
> half done series where I had PARTIAL enabled and noticed a slight
> increase in latency so made it "default y on !RT". It wasn't dramatic
> but appeared to be outside of noise.
> 
> Sebastian
>