On 3/3/23, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > On Fri, Mar 3, 2023 at 12:39 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: >> >> I think there is a systemic problem which comes with the kzalloc API > > Well, it's not necessarily the API that is bad, but the implementation. > > We could easily make kzalloc() with a constant size just expand to > kmalloc+memset, and get the behavior you want. > > We already do magical things for "find the right slab bucket" part of > kmalloc too for constant sizes. It's changed over the years, but that > policy goes back a long long time. See > > > https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=95203fe78007f9ab3aebb96606473ae18c00a5a8 > > from the BK history tree. > > Exactly because some things are worth optimizing for when the size is > known at compile time. > > Maybe just extending kzalloc() similarly? Trivial and entirely untested > patch: > > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct > kmem_cache *k, gfp_t flags) > */ > static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags) > { > + if (__builtin_constant_p(size)) { > + void *ret = kmalloc(size, flags); > + if (ret) > + memset(ret, 0, size); > + return ret; > + } > return kmalloc(size, flags | __GFP_ZERO); > } > So I played with this and have a rather nasty summary. Bullet points: 1. patched kzalloc does not reduce memsets calls during kernel build 2. patched kmem_cache_zalloc_ptr + 2 consumers converted *does* drop it significantly (36150671 -> 14414454) 3. ... inline memset generated by gcc sucks by resorting to rep stosq around 48 bytes 4. memsets not sorted out have sizes not known at compilation time and are not necessarily perf bugs on their own [read: would benefit from faster memset] Onto the the meat: I patched the kernel with a slightly tweaked version of the above: diff --git a/include/linux/slab.h b/include/linux/slab.h index 45af70315a94..7abb5490690f 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct kmem_cache *k, gfp_t flags) */ static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags) { + if (__builtin_constant_p(size)) { + void *ret = kmalloc(size, flags); + if (likely(ret)) + memset(ret, 0, size); + return ret; + } return kmalloc(size, flags | __GFP_ZERO); } and verified it indeed zeroes inline: void kztest(void) { void *ptr; ptr = kzalloc(32, GFP_KERNEL); if (unlikely(!ptr)) return; memsettest_rec(ptr); } $ objdump --disassemble=kztest vmlinux [snip] call ffffffff8135e130 <kmalloc_trace> test %rax,%rax je ffffffff81447d5f <kztest+0x4f> movq $0x0,(%rax) mov %rax,%rdi movq $0x0,0x8(%rax) movq $0x0,0x10(%rax) movq $0x0,0x18(%rax) call ffffffff81454060 <memsettest_rec> [snip] This did *NOT* lead to reduction of memset calls when building the kernel. I verified few cases by hand, it is all either kmem_cache_zalloc or explicitly added memsets with sizes not known at compilation time. Two most frequent callers: @[ memset+5 __alloc_file+40 alloc_empty_file+73 path_openat+77 do_filp_open+182 do_sys_openat2+159 __x64_sys_openat+89 do_syscall_64+93 entry_SYSCALL_64_after_hwframe+114 ]: 11028994 @[ memset+5 security_file_alloc+45 __alloc_file+89 alloc_empty_file+73 path_openat+77 do_filp_open+182 do_sys_openat2+159 __x64_sys_openat+89 do_syscall_64+93 entry_SYSCALL_64_after_hwframe+114 ]: 11028994 My wip addition is: diff --git a/include/linux/slab.h b/include/linux/slab.h index 45af70315a94..12b5b02ef3d3 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -710,6 +710,17 @@ static inline void *kmem_cache_zalloc(struct kmem_cache *k, gfp_t flags) return kmem_cache_alloc(k, flags | __GFP_ZERO); } +#define kmem_cache_zalloc_ptr(k, f, retp) ({ \ + __typeof(retp) _retp = kmem_cache_alloc(k, f); \ + bool _rv = false; \ + retp = _retp; \ + if (likely(_retp)) { \ + memset(_retp, 0, sizeof(*_retp)); \ + _rv = true; \ + } \ + _rv; \ +}) + diff --git a/security/security.c b/security/security.c index cf6cc576736f..0f769ede0e54 100644 --- a/security/security.c +++ b/security/security.c @@ -600,8 +600,7 @@ static int lsm_file_alloc(struct file *file) return 0; } - file->f_security = kmem_cache_zalloc(lsm_file_cache, GFP_KERNEL); - if (file->f_security == NULL) + if (!kmem_cache_zalloc_ptr(lsm_file_cache, GFP_KERNEL, file->f_security)) return -ENOMEM; return 0; } diff --git a/fs/file_table.c b/fs/file_table.c index 372653b92617..8e0dabf9530e 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -136,8 +136,7 @@ static struct file *__alloc_file(int flags, const struct cred *cred) struct file *f; int error; - f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL); - if (unlikely(!f)) + if (!kmem_cache_zalloc_ptr(filp_cachep, GFP_KERNEL, f)) return ERR_PTR(-ENOMEM); f->f_cred = get_cred(cred); As mentioned above it cuts total calls in more than half. The problem is it is it rolls with rep stosq way too easily, partially defeating the point of inlining anything. clang does not have this problem. Take a look at __alloc_file: [snip] mov 0x19cab05(%rip),%rdi # ffffffff82df4318 <filp_cachep> call ffffffff813dd610 <kmem_cache_alloc> test %rax,%rax je ffffffff814298b7 <__alloc_file+0xc7> mov %rax,%r12 mov $0x1d,%ecx xor %eax,%eax mov %r12,%rdi rep stos %rax,%es:(%rdi) [/snip] Here is a sample consumer which can't help but have a variable size -- select, used by gmake: @[ memset+5 do_pselect.constprop.0+202 __x64_sys_pselect6+101 do_syscall_64+93 entry_SYSCALL_64_after_hwframe+114 ]: 13160 In conclusion: 1. fixing up memsets is worthwhile regardless of what happens to its current consumers -- not all of them are necessarily doing something wrong 2. inlining memset can be a pessimization vs non-plain-erms memset as evidenced above. will have to figure out how to convince gcc to be less eager to use it. Sometimes I hate computers. -- Mateusz Guzik <mjguzik gmail.com>