On 3/4/23, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > On 3/3/23, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: >> On Fri, Mar 3, 2023 at 12:39 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: >>> >>> I think there is a systemic problem which comes with the kzalloc API >> >> Well, it's not necessarily the API that is bad, but the implementation. >> >> We could easily make kzalloc() with a constant size just expand to >> kmalloc+memset, and get the behavior you want. >> >> We already do magical things for "find the right slab bucket" part of >> kmalloc too for constant sizes. It's changed over the years, but that >> policy goes back a long long time. See >> >> >> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=95203fe78007f9ab3aebb96606473ae18c00a5a8 >> >> from the BK history tree. >> >> Exactly because some things are worth optimizing for when the size is >> known at compile time. >> >> Maybe just extending kzalloc() similarly? Trivial and entirely untested >> patch: >> >> --- a/include/linux/slab.h >> +++ b/include/linux/slab.h >> @@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct >> kmem_cache *k, gfp_t flags) >> */ >> static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags) >> { >> + if (__builtin_constant_p(size)) { >> + void *ret = kmalloc(size, flags); >> + if (ret) >> + memset(ret, 0, size); >> + return ret; >> + } >> return kmalloc(size, flags | __GFP_ZERO); >> } >> > > So I played with this and have a rather nasty summary. Bullet points: > 1. patched kzalloc does not reduce memsets calls during kernel build > 2. patched kmem_cache_zalloc_ptr + 2 consumers converted *does* drop > it significantly (36150671 -> 14414454) > 3. ... inline memset generated by gcc sucks by resorting to rep stosq > around 48 bytes > 4. memsets not sorted out have sizes not known at compilation time and > are not necessarily perf bugs on their own [read: would benefit from > faster memset] > > Onto the the meat: > > I patched the kernel with a slightly tweaked version of the above: > diff --git a/include/linux/slab.h b/include/linux/slab.h > index 45af70315a94..7abb5490690f 100644 > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct > kmem_cache *k, gfp_t flags) > */ > static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags) > { > + if (__builtin_constant_p(size)) { > + void *ret = kmalloc(size, flags); > + if (likely(ret)) > + memset(ret, 0, size); > + return ret; > + } > return kmalloc(size, flags | __GFP_ZERO); > } > > and verified it indeed zeroes inline: > > void kztest(void) > { > void *ptr; > > ptr = kzalloc(32, GFP_KERNEL); > if (unlikely(!ptr)) > return; > memsettest_rec(ptr); > } > > $ objdump --disassemble=kztest vmlinux > [snip] > call ffffffff8135e130 <kmalloc_trace> > test %rax,%rax > je ffffffff81447d5f <kztest+0x4f> > movq $0x0,(%rax) > mov %rax,%rdi > movq $0x0,0x8(%rax) > movq $0x0,0x10(%rax) > movq $0x0,0x18(%rax) > call ffffffff81454060 <memsettest_rec> > [snip] > > This did *NOT* lead to reduction of memset calls when building the kernel. > > I verified few cases by hand, it is all either kmem_cache_zalloc or > explicitly added memsets with sizes not known at compilation time. > > Two most frequent callers: > @[ > memset+5 > __alloc_file+40 > alloc_empty_file+73 > path_openat+77 > do_filp_open+182 > do_sys_openat2+159 > __x64_sys_openat+89 > do_syscall_64+93 > entry_SYSCALL_64_after_hwframe+114 > ]: 11028994 > @[ > memset+5 > security_file_alloc+45 > __alloc_file+89 > alloc_empty_file+73 > path_openat+77 > do_filp_open+182 > do_sys_openat2+159 > __x64_sys_openat+89 > do_syscall_64+93 > entry_SYSCALL_64_after_hwframe+114 > ]: 11028994 > > My wip addition is: > > diff --git a/include/linux/slab.h b/include/linux/slab.h > index 45af70315a94..12b5b02ef3d3 100644 > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -710,6 +710,17 @@ static inline void *kmem_cache_zalloc(struct > kmem_cache *k, gfp_t flags) > return kmem_cache_alloc(k, flags | __GFP_ZERO); > } > > +#define kmem_cache_zalloc_ptr(k, f, retp) ({ \ > + __typeof(retp) _retp = kmem_cache_alloc(k, f); \ > + bool _rv = false; \ > + retp = _retp; \ > + if (likely(_retp)) { \ > + memset(_retp, 0, sizeof(*_retp)); \ > + _rv = true; \ > + } \ > + _rv; \ > +}) > + > diff --git a/security/security.c b/security/security.c > index cf6cc576736f..0f769ede0e54 100644 > --- a/security/security.c > +++ b/security/security.c > @@ -600,8 +600,7 @@ static int lsm_file_alloc(struct file *file) > return 0; > } > > - file->f_security = kmem_cache_zalloc(lsm_file_cache, GFP_KERNEL); > - if (file->f_security == NULL) > + if (!kmem_cache_zalloc_ptr(lsm_file_cache, GFP_KERNEL, > file->f_security)) > return -ENOMEM; > return 0; > } This one is actually buggy -- f_security is a void * pointer and sizeof(*file->f_security) returns just 1. The macro does not have any safety belts against that -- it should probably check for void * at compilation time and get a BUG_ON for runtime mismatch. Does not affect the idea though. Good news: gcc provides a lot of control as to how it inlines string ops, most notably: -mstringop-strategy=alg Override the internal decision heuristic for the particular algorithm to use for inlining string operations. The allowed values for alg are: rep_byte rep_4byte rep_8byte Expand using i386 "rep" prefix of the specified size. byte_loop loop unrolled_loop Expand into an inline loop. libcall Always use a library call. I'm going to play with it and send something more presentable. > diff --git a/fs/file_table.c b/fs/file_table.c > index 372653b92617..8e0dabf9530e 100644 > --- a/fs/file_table.c > +++ b/fs/file_table.c > @@ -136,8 +136,7 @@ static struct file *__alloc_file(int flags, const > struct cred *cred) > struct file *f; > int error; > > - f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL); > - if (unlikely(!f)) > + if (!kmem_cache_zalloc_ptr(filp_cachep, GFP_KERNEL, f)) > return ERR_PTR(-ENOMEM); > > f->f_cred = get_cred(cred); > > As mentioned above it cuts total calls in more than half. > > The problem is it is it rolls with rep stosq way too easily, partially > defeating the point of inlining anything. clang does not have this > problem. > > Take a look at __alloc_file: > [snip] > mov 0x19cab05(%rip),%rdi # ffffffff82df4318 <filp_cachep> > call ffffffff813dd610 <kmem_cache_alloc> > test %rax,%rax > je ffffffff814298b7 <__alloc_file+0xc7> > mov %rax,%r12 > mov $0x1d,%ecx > xor %eax,%eax > mov %r12,%rdi > rep stos %rax,%es:(%rdi) > [/snip] > > Here is a sample consumer which can't help but have a variable size -- > select, used by gmake: > @[ > memset+5 > do_pselect.constprop.0+202 > __x64_sys_pselect6+101 > do_syscall_64+93 > entry_SYSCALL_64_after_hwframe+114 > ]: 13160 > > In conclusion: > 1. fixing up memsets is worthwhile regardless of what happens to its > current consumers -- not all of them are necessarily doing something > wrong > 2. inlining memset can be a pessimization vs non-plain-erms memset as > evidenced above. will have to figure out how to convince gcc to be > less eager to use it. > > Sometimes I hate computers. > > -- > Mateusz Guzik <mjguzik gmail.com> > -- Mateusz Guzik <mjguzik gmail.com>