Re: Mainline kernel OLTP performance update

Nick Piggin <nickpiggin@xxxxxxxxxxxx> · Fri, 16 Jan 2009 19:59:29 +1100

On Friday 16 January 2009 18:00:43 Andrew Morton wrote:
> On Fri, 16 Jan 2009 17:46:23 +1100 Nick Piggin <nickpiggin@xxxxxxxxxxxx> 
> > SLQB tends to be the winner here.
>
> Can you think of anything with which it will be the loser?

Here are some more performance numbers with "slub_test" kernel module.
It's basically a really tiny microbenchmark, so I don't really consider
it gives too useful results, except it does show up some problems in
SLAB's scalability  that may start to bite as we continue to get more
threads per socket.

(I ran a few of these tests on one of Dave's 2 socket, 128 thread
systems, and slab gets really painful... these kinds of thread counts
may only be a couple of years away from x86).

All numbers are in CPU cycles.

Single thread testing
=====================
1. Kmalloc: Repeatedly allocate 10000 objs then free them
obj size  SLAB       SLQB      SLUB
8           77+ 128   69+ 47   61+ 77
16          69+ 104  116+ 70   77+ 80
32          66+ 101   82+ 81   71+ 89
64          82+ 116   95+ 81   94+105
128        100+ 148  106+ 94  114+163
256        153+ 136  134+ 98  124+186
512        209+ 161  170+186  134+276
1024       331+ 249  236+245  134+283
2048       608+ 443  380+386  172+312
4096      1109+ 624  678+661  239+372
8192      1166+1077  767+683  535+433
16384     1213+1160  914+731  577+682

We can see SLAB has a fair bit more overhead in this case. SLUB starts
doing higher order allocations I think around size 256, which reduces
costs there. Don't know what the SLQB artifact at 16 is caused by...

2. Kmalloc: alloc/free test (repeatedly allocate and free)
       SLAB  SLQB  SLUB
8       98   90     94
16      98   90     93
32      98   90     93
64      99   90     94
128    100   92     93
256    104   93     95
512    105   94     97
1024   106   93     97
2048   107   95     95
4096   111   92     97
8192   111   94    631
16384  114   92    741

Here we see SLUB's allocator passthrough (or is the the lack of queueing?).
Straight line speed at small sizes is probably due to instructions in the
fastpaths. It's pretty meaningless though because it probably changes if
there is any actual load on the CPU, or another CPU architecture. Doesn't
look bad for SLQB though :)

Concurrent allocs
=================
1. Like the first single thread test, lots of allocs, then lots of frees.
But running on all CPUs. Average over all CPUs.
       SLAB        SLQB         SLUB
8        251+ 322    73+  47   65+  76
16       240+ 331    84+  53   67+  82
32       235+ 316    94+  57   77+  92
64       338+ 303   120+  66  105+ 136
128      549+ 355   139+ 166  127+ 344
256     1129+ 456   189+ 178  236+ 404
512     2085+ 872   240+ 217  244+ 419
1024    3895+1373   347+ 333  251+ 440
2048    7725+2579   616+ 695  373+ 588
4096   15320+4534  1245+1442  689+1002

A problem with SLAB scalability starts showing up on this system with only
4 threads per socket. Again, SLUB sees a benefit from higher order
allocations.

2. Same as 2nd single threaded test, alloc then free, on all CPUs.
      SLAB  SLQB  SLUB
8      99   90    93
16     99   90    93
32     99   90    93
64    100   91    94
128   102   90    93
256   105   94    97
512   106   93    97
1024  108   93    97
2048  109   93    96
4096  110   93    96

No surprises. Objects always fit in queues (or unqueues, in the case of
SLUB), so there is no cross cache traffic.

Remote free test
================
1. Allocate N objects on CPUs 1-7, then free them all from CPU 0. Average cost
   of all kmalloc+kfree
      SLAB        SLQB     SLUB
8       191+ 142   53+ 64  56+99
16      180+ 141   82+ 69  60+117
32      173+ 142  100+ 71  78+151
64      240+ 147  131+ 73  117+216
128     441+ 162  158+114  114+251
256     833+ 181  179+119  185+263
512    1546+ 243  220+132  194+292
1024   2886+ 341  299+135  201+312
2048   5737+ 577  517+139  291+370
4096  11288+1201  976+153  528+482

2. All CPUs allocate on objects on CPU N, then freed by CPU N+1 % NR_CPUS
   (ie. CPU1 frees objects allocated by CPU0).
      SLAB        SLQB     SLUB
8       236+ 331   72+123   64+ 114
16      232+ 345   80+125   71+ 139
32      227+ 342   85+134   82+ 183
64      324+ 336  140+138  111+ 219
128     569+ 384  245+201  145+ 337
256    1111+ 448  243+222  238+ 447
512    2091+ 871  249+244  247+ 470
1024   3923+1593  254+256  254+ 503
2048   7700+2968  273+277  369+ 699
4096  15154+5061  310+323  693+1220

SLAB's concurrent allocation bottlnecks show up again in these tests.

Unfortunately these are not too realistic tests of remote freeing pattern,
because normally you would expect remote freeing and allocation happening
concurrently, rather than all allocations up front, then all frees. If
the test behaved like that, then object could probably fit in SLAB's
queues and it might see some good numbers.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html