> On Dec 3, 2018, at 7:12 PM, Nadav Amit <nadav.amit@xxxxxxxxx> wrote: > >> On Dec 3, 2018, at 2:49 PM, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: >> >> On Mon, Dec 03, 2018 at 02:04:41PM -0800, Nadav Amit wrote: >>> On Dec 3, 2018, at 8:13 AM, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: >>>> On Mon, Dec 03, 2018 at 02:59:36PM +0100, Vlastimil Babka wrote: >>>>> On 11/28/18 3:01 PM, Matthew Wilcox wrote: >>>>>> Some of the functions in vmalloc.c have as many as nine arguments. >>>>>> So I thought I'd have a quick go at bundling the ones that make sense >>>>>> into a struct and pass around a pointer to that struct. Well, it made >>>>>> the generated code worse, >>>>> >>>>> Worse in which metric? >>>> >>>> More instructions to accomplish the same thing. >>>> >>>>>> so I thought I'd share my attempt so nobody >>>>>> else bothers (or soebody points out that I did something stupid). >>>>> >>>>> I guess in some of the functions the args parameter could be const? >>>>> Might make some difference. >>>>> >>>>> Anyway this shouldn't be a fast path, so even if the generated code is >>>>> e.g. somewhat larger, then it still might make sense to reduce the >>>>> insane parameter lists. >>>> >>>> It might ... I'm not sure it's even easier to program than the original >>>> though. >>> >>> My intuition is that if all the fields of vm_args were initialized together >>> (in the same function), and a 'const struct vm_args *' was provided as >>> an argument to other functions, code would be better (at least better than >>> what you got right now). >>> >>> I’m not saying it is easily applicable in this use-case (since I didn’t >>> check). >> >> Your intuition is wrong ... >> >> text data bss dec hex filename >> 9466 81 32 9579 256b before.o >> 9546 81 32 9659 25bb .build-tiny/mm/vmalloc.o >> 9546 81 32 9659 25bb const.o >> >> indeed, there's no difference between with or without the const, according >> to 'cmp'. >> >> Now, only alloc_vmap_area() gets to take a const argument. >> __get_vm_area_node() intentionally modifies the arguments. But feel >> free to play around with this; you might be able to make it do something >> worthwhile. > > I was playing with it (a bit). What I suggested (modifying > __get_vm_area_node() so it will not change arguments) helps a bit, but not > much. > > One insight that I got is that at least part of the overhead comes from the > the stack protector code that gcc emits. [ +Peter ] So I dug some more (I’m still not done), and found various trivial things (e.g., storing zero extending u32 immediate is shorter for registers, inlining already takes place). *But* there is one thing that may require some attention - patch b59167ac7bafd ("x86/percpu: Fix this_cpu_read()”) set ordering constraints on the VM_ARGS() evaluation. And this patch also imposes, it appears, (unnecessary) constraints on other pieces of code. These constraints are due to the addition of the volatile keyword for this_cpu_read() by the patch. This affects at least 68 functions in my kernel build, some of which are hot (I think), e.g., finish_task_switch(), smp_x86_platform_ipi() and select_idle_sibling(). Peter, perhaps the solution was too big of a hammer? Is it possible instead to create a separate "this_cpu_read_once()” with the volatile keyword? Such a function can be used for native_sched_clock() and other seqlocks, etc.