On Fri, 2 Oct 2015 14:50:44 -0700 Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > On Fri, 2 Oct 2015 15:40:39 +0200 Jesper Dangaard Brouer <brouer@xxxxxxxxxx> wrote: > > > > Thus, I need introducing new code like this patch and at the same time > > > have to reduce the number of instruction-cache misses/usage. In this > > > case we solve the problem by kmem_cache_free_bulk() not getting called > > > too often. Thus, +17 bytes will hopefully not matter too much... but on > > > the other hand we sort-of know that calling kmem_cache_free_bulk() will > > > cause icache misses. > > > > I just tested this change on top of my net-use-case patchset... and for > > some strange reason the code with this WARN_ON is faster and have much > > less icache-misses (1,278,276 vs 2,719,158 L1-icache-load-misses). > > > > Thus, I think we should keep your fix. > > > > I cannot explain why using WARN_ON() is better and cause less icache > > misses. And I hate when I don't understand every detail. > > > > My theory is, after reading the assembler code, that the UD2 > > instruction (from BUG_ON) cause some kind of icache decoder stall > > (Intel experts???). Now that should not be a problem, as UD2 is > > obviously placed as an unlikely branch and left at the end of the asm > > function call. But the call to __slab_free() is also placed at the end > > of the asm function (gets inlined from slab_free() as unlikely). And > > it is actually fairly likely that bulking is calling __slab_free (slub > > slowpath call). > > Yes, I was looking at the asm code and the difference is pretty small: > a not-taken ud2 versus a not-taken "call warn_slowpath_null", mainly. Actually managed to find documentation that the UD2 instruction do stops the instruction prefetcher. >From "Intel® 64 and IA-32 Architectures Optimization Reference Manual" Section 3.4.1.6 "Branch Type Selection": Cite: "[...] follow the indirect branch with a UD2 instruction, which will stop the processor from decoding down the fall-through path." And from LWN article: https://lwn.net/Articles/255364/ Cite: "[...] like using the ud2 instruction [...]. This instruction, which cannot be executed itself, is used after an indirect jump instruction; it is used as a signal to the instruction fetcher that the processor should not waste efforts decoding the following memory since the execution will continue at a different location." Thus, we were hitting a very unfortunately combination of code, there the UD2 instruction was located before the call to __slab_free() in the ASM code, causing instruction prefetching to fail/stop. Now I understand all the details again :-) My only problem left, is I want a perf measurement that pinpoint these kind of spots. The difference in L1-icache-load-misses were significant (1,278,276 vs 2,719,158). I tried to somehow perf record this with different perf events without being able to pinpoint the location (even though I know the spot now). Even tried Andi's ocperf.py... maybe he will know what event I should try? > But I wouldn't assume that the microbenchmarking is meaningful. I've > seen shockingly large (and quite repeatable) microbenchmarking > differences from small changes in code which isn't even executed (and > this is one such case, actually). You add or remove just one byte of > text and half the kernel (or half the .o file?) gets a different > alignment and this seems to change everything. I do know alignment of code is important, but this performance impact was just too large. And I think I found documentation to back my theory. > Deleting the BUG altogether sounds the best solution. As long as the > kernel crashes in some manner, we'll be able to work out what happened. > And it's cant-happen anyway, isn't it? To me WARN_ON() seems like a good "documentation" if it does not hurt performance. I don't think removing the WARN_ON() will improve performance, but I'm willing to actually test if it matters. Below is how it will crash, it is fairly obvious (if you know x86_64 call convention). API is: void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p) Below it clearly shows second arg "size" in %RSI == 0. [ 2986.977708] BUG: unable to handle kernel paging request at ffffffffa03c4d58 [ 2986.977760] IP: [<ffffffff8116d044>] kmem_cache_free_bulk+0x54/0x1b0 [ 2986.977805] PGD 19a4067 PUD 19a5063 PMD 408a66067 PTE 3f6407161 [ 2986.977879] Oops: 0003 [#1] SMP [ 2986.977936] Modules linked in: slab_bulk_test03(O+) time_bench(O) netconsole tun coretemp kvm_intel kvm mxm_wmi microcode sg i2c_i801 i2c_core pcspkr wmi video shpchp acpi_pad nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc serio_raw sd_mod hid_generic ixgbe mdio vxlan ip6_udp_tunnel e1000e udp_tunnel ptp pps_core mlx5_core [last unloaded: slab_bulk_test03] [ 2986.978190] CPU: 1 PID: 12495 Comm: modprobe Tainted: G O 4.3.0-rc2-net-next-mm+ #442 [ 2986.978265] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Extreme4, BIOS P2.10 05/12/2015 [ 2986.978341] task: ffff88040a89d200 ti: ffff8803e3c50000 task.ti: ffff8803e3c50000 [ 2986.978414] RIP: 0010:[<ffffffff8116d044>] [<ffffffff8116d044>] kmem_cache_free_bulk+0x54/0x1b0 [ 2986.978490] RSP: 0018:ffff8803e3c53aa8 EFLAGS: 00010282 [ 2986.978528] RAX: 0000000000000000 RBX: ffff8803e3c53c30 RCX: 0000000100200016 [ 2986.978571] RDX: ffff8803e3c53ad8 RSI: 0000000000000000 RDI: 000077ff80000000 [ 2986.978614] RBP: ffff8803e3c53ad0 R08: ffff8803e3db3600 R09: 00000000e3db3b01 [ 2986.978655] R10: ffffffffa03c4d58 R11: 0000000000000001 R12: ffffffffffffffff [ 2986.978698] R13: ffff8803e3c53ae0 R14: 000077ff80000000 R15: ffff88040d774c00 [ 2986.978740] FS: 00007fbf31190740(0000) GS:ffff88041fa40000(0000) knlGS:0000000000000000 [ 2986.978814] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2986.978853] CR2: ffffffffa03c4d58 CR3: 00000003f6406000 CR4: 00000000001406e0 [ 2986.978895] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2986.978938] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2986.978979] Stack: [ 2986.979014] ffff8803e3c53c30 0000000000000009 0000000000000000 ffffffffa002e000 [ 2986.979099] ffff8803fc40db40 ffff8803e3c53cf0 ffffffffa03c4d58 0000000000000000 [ 2986.979184] 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 2986.979259] Call Trace: [ 2986.979296] [<ffffffffa002e000>] ? 0xffffffffa002e000 [ 2986.979335] [<ffffffffa03c4d58>] run_try_crash_tests+0x378/0x3a6 [slab_bulk_test03] [ 2986.979409] [<ffffffffa002e12d>] slab_bulk_test03_module_init+0x12d/0x1000 [slab_bulk_test03] [ 2986.979485] [<ffffffff810002ed>] do_one_initcall+0xad/0x1d0 [ 2986.979526] [<ffffffff8111c117>] do_init_module+0x60/0x1e8 [ 2986.979567] [<ffffffff810c2778>] load_module+0x1bb8/0x2420 [ 2986.979608] [<ffffffff810bf1b0>] ? __symbol_put+0x40/0x40 [ 2986.979647] [<ffffffff810c31b0>] SyS_finit_module+0x80/0xb0 [ 2986.979690] [<ffffffff8166b697>] entry_SYSCALL_64_fastpath+0x12/0x6a [ 2986.979732] Code: f8 eb 05 4d 85 e4 74 13 4c 8b 10 49 83 ec 01 48 89 c2 48 83 e8 08 4d 85 d2 74 e8 4d 85 d2 0f 84 2e 01 00 00 49 63 47 20 4c 89 f7 <49> c7 04 02 00 00 00 00 b8 00 00 00 80 4c 01 d0 48 0f 42 3d b4 [ 2986.979904] RIP [<ffffffff8116d044>] kmem_cache_free_bulk+0x54/0x1b0 [ 2986.979945] RSP <ffff8803e3c53aa8> [ 2986.979982] CR2: ffffffffa03c4d58 [ 2986.980130] ---[ end trace 64f42b02f4347220 ]--- (gdb) list *(kmem_cache_free_bulk)+0x54 0xffffffff8116d044 is in kmem_cache_free_bulk (mm/slub.c:269). 264 return p; 265 } 266 267 static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp) 268 { 269 *(void **)(object + s->offset) = fp; 270 } 271 272 /* Loop over all objects in a slab */ 273 #define for_each_object(__p, __s, __addr, __objects) \ -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href