On Wed, Jan 15, 2020 at 02:14:46PM +0100, Uladzislau Rezki wrote: > Hello, Joel, Paul. > > Thank you for comments and testing! > > > > > > > Nice improvement! > > > > > > But rcuperf uses a single block size, which turns into kfree_bulk() using > > > a single slab, which results in good locality of reference. So I have to > > > > You meant a "single cache" category when you say "single slab"? Just to > > mention, the number of slabs (in a single cache) when a large number of > > objects are allocated is more than 1 (not single). With current rcuperf, I > > see 100s of slabs (each slab being one page) in the kmalloc-32 cache. Each > > slab contains around 128 objects of type kfree_rcu (24 byte object aligned to > > 32-byte slab object). > > > I think that is about using different slab caches to break locality. It > makes sense, IMHO, because usually the system make use of different slabs, > because of different object sizes. From the other hand i guess there are > test cases when only one slab gets used. I was wondering about "locality". A cache can be split into many slabs. Only the data on a page is local (contiguous). If there are a large number of objects, then it goes to a new slab (on the same cache). At least on the kmalloc slabs, there is only 1 slab per page. So for example, if on kmalloc-32 slab, there are more than 128 objects, then it goes to a different slab / page. So how is there still locality? Further the slab (not sure about slub) doesn't seem to do anything at the moment to take advantage of locality within a slab. That said, I am fully supportive of your patch and see the same improvements as well which are for the reasons you mentioned in the changelog. > > > ask... Is this performance result representative of production workloads? > > > > I added more variation to allocation sizes to rcuperf (patch below) to distribute > > allocations across 4 kmalloc slabs (32,64,96 and 128) and I see a signficant > > improvement with Ulad's patch in SLAB in terms of completion time of the > > test. Below are the results. With SLUB I see slightly higher memory > > footprint, I have never used SLUB and not sure who is using it so I am not > > too concerned since the degradation in memory footprint is only slight with > > SLAB having the signifcant improvement. > > > Nice patch! I think, it would be useful to have it in "rcuperf" tool with > extra parameter like "different_obj_sizes". cool, I posted something like this. > > 2.25.0.rc1.283.g88dfdc4193-goog > I also have done some tests with your patch on my Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz, 12xCPUs > machine to simulate different slab usage: > > dev.2020.01.10a branch > > # Default, CONFIG_SLAB, kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1, 16, 32, 64, 96 obj sizes > [ 83.762963] Total time taken by all kfree'ers: 53607352517 ns, loops: 200000, batches: 1885, memory footprint: 1248MB > [ 80.108401] Total time taken by all kfree'ers: 53529637912 ns, loops: 200000, batches: 1921, memory footprint: 1193MB > [ 76.622252] Total time taken by all kfree'ers: 53570175705 ns, loops: 200000, batches: 1929, memory footprint: 1250MB > > # With the patch, CONFIG_SLAB, kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1, 16, 32, 64, 96 obj sizes > [ 48.265008] Total time taken by all kfree'ers: 23981587315 ns, loops: 200000, batches: 810, memory footprint: 1219MB > [ 53.263943] Total time taken by all kfree'ers: 23879375281 ns, loops: 200000, batches: 822, memory footprint: 1190MB > [ 50.366440] Total time taken by all kfree'ers: 24086841707 ns, loops: 200000, batches: 794, memory footprint: 1380MB > > # Default, CONFIG_SLUB, kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1, 16, 32, 64, 96 obj sizes > [ 81.818576] Total time taken by all kfree'ers: 51291025022 ns, loops: 200000, batches: 1713, memory footprint: 741MB > [ 77.854866] Total time taken by all kfree'ers: 51278911477 ns, loops: 200000, batches: 1671, memory footprint: 719MB > [ 76.329577] Total time taken by all kfree'ers: 51256183045 ns, loops: 200000, batches: 1719, memory footprint: 647MB > > # With the patch, CONFIG_SLUB, kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1, 16, 32, 64, 96 obj sizes > [ 76.254485] Total time taken by all kfree'ers: 50709919132 ns, loops: 200000, batches: 1618, memory footprint: 456MB > [ 75.891521] Total time taken by all kfree'ers: 50736297452 ns, loops: 200000, batches: 1633, memory footprint: 507MB > [ 76.172573] Total time taken by all kfree'ers: 50660403893 ns, loops: 200000, batches: 1628, memory footprint: 429MB > > in case of CONFIG_SLAB there is double increase in performance but slightly higher memory usage. > As for CONFIG_SLUB, i still see higher performance figures + lower memory usage with the patch. Ok, testing today, our results are quite similar. > > Apart of that, I have got the report from the "kernel test robot": > <snip> > [ 13.957168] ------------[ cut here ]------------ > [ 13.958256] ODEBUG: free active (active state 1) object type: rcu_head hint: 0x0 > [ 13.962148] WARNING: CPU: 0 PID: 212 at lib/debugobjects.c:484 debug_print_object+0x95/0xd0 > [ 13.964298] Modules linked in: > [ 13.964960] CPU: 0 PID: 212 Comm: kworker/0:2 Not tainted 5.5.0-rc1-00136-g883a2cefc0684 #1 > [ 13.966712] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 > [ 13.968528] Workqueue: events kfree_rcu_work > [ 13.969466] RIP: 0010:debug_print_object+0x95/0xd0 > [ 13.970480] Code: d2 e8 2f 06 d6 ff 8b 43 10 4d 89 f1 4c 89 e6 8b 4b 14 48 c7 c7 88 73 be 82 4d 8b 45 00 48 8b 14 c5 a0 5f 6d 82 e8 7b 65 c6 ff <0f> 0b b9 01 00 00 00 31 d2 be 01 00 00 00 48 c7 c7 98 b8 0c 83 e8 > [ 13.974435] RSP: 0000:ffff888231677bf8 EFLAGS: 00010282 > [ 13.975531] RAX: 0000000000000000 RBX: ffff88822d4200e0 RCX: 0000000000000000 > [ 13.976730] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8306e028 > [ 13.977568] RBP: ffff888231677c18 R08: 0000000000000000 R09: ffff888231670790 > [ 13.978412] R10: ffff888231670000 R11: 0000000000000003 R12: ffffffff82bc5299 > [ 13.979250] R13: ffffffff82e77360 R14: 0000000000000000 R15: dead000000000100 > [ 13.980089] FS: 0000000000000000(0000) GS:ffffffff82e4f000(0000) knlGS:0000000000000000 > [ 13.981069] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 13.981746] CR2: 00007f1e913fc77c CR3: 0000000225ce9000 CR4: 00000000000006f0 > [ 13.982587] Call Trace: > [ 13.982911] __debug_check_no_obj_freed+0x19a/0x200 > [ 13.983494] debug_check_no_obj_freed+0x14/0x20 > [ 13.984036] free_pcp_prepare+0xee/0x1d0 > [ 13.984541] free_unref_page+0x1b/0x80 > [ 13.984994] __free_pages+0x19/0x20 > [ 13.985503] __free_pages+0x13/0x20 > [ 13.985924] slob_free_pages+0x7d/0x90 > [ 13.986373] slob_free+0x34f/0x530 > [ 13.986784] kfree+0x154/0x210 > [ 13.987155] __kmem_cache_free_bulk+0x44/0x60 > [ 13.987673] kmem_cache_free_bulk+0xe/0x10 > [ 13.988163] kfree_rcu_work+0x95/0x310 > [ 13.989010] ? kfree_rcu_work+0x64/0x310 > [ 13.989884] process_one_work+0x378/0x7c0 > [ 13.990770] worker_thread+0x40/0x600 > [ 13.991587] kthread+0x14e/0x170 > [ 13.992344] ? process_one_work+0x7c0/0x7c0 > [ 13.993256] ? kthread_create_on_node+0x70/0x70 > [ 13.994246] ret_from_fork+0x3a/0x50 > [ 13.995039] ---[ end trace cdf242638b0e32a0 ]--- > [child0:632] trace_fd was -1 > <snip> > > the trace happens when the kernel is built with CONFIG_DEBUG_OBJECTS_FREE > and CONFIG_DEBUG_OBJECTS_RCU_HEAD. Basically it is not a problem of the patch > itself or there is any bug there. It just does not pair with debug_rcu_head_queue(head) > in the kfree_rcu_work() function, that is why the kernel thinks about freeing > an active object that is not active in reality. > > I will upload a V2 to fix that. Oh good point. Thanks for fixing that. thanks, - Joel