On Mon, Nov 20, 2023 at 1:32 AM Masahiro Yamada <masahiroy@xxxxxxxxxx> wrote: > > On Wed, Nov 15, 2023 at 9:49 AM Andi Kleen <ak@xxxxxxxxxxxxxxx> wrote: > > > This is a standalone script, which is unrelated to kbuild. > > Please change the subject to > > "scripts: Add inline-account tool" > > > > > > > > > A common cause of binary code bloat is excessive inlining. Traditional > > tools (like nm --size-sort -t d) don't address that directly because > > they only see the final functions, but don't know about inlines. > > > > This patch adds inline-account that makes it easy to track that down > > by accounting code bytes to all functions visible in the debug information, > > as well as code lines. > > > > Here are some examples: > > > > Show all inlines that increase code size by >1K in the core scheduler: > > > > $ inline-account.py --min-bytes=1000 kernel/sched/core.o > > Total code bytes seen 75690 > > > > Code bytes by functions: > > Function Total Avg Num > > rq_pin_lock 1401 (0.02%) 35 39 > > __sched_setscheduler 1277 (0.02%) 41 31 > > perf_fetch_caller_regs 1012 (0.01%) 17 58 > > > > Code bytes by nearby source line blocks: > > prefix /home/ak/lsrc/git/linux/ > > Line Total > > kernel/sched/sched.h:1610 1387 (0.02%) > > include/trace/events/sched.h:16 1172 (0.02%) > > include/trace/events/sched.h:222 1058 (0.01%) > > > > This indicates that rq_pin_lock should likely be not inline, > > and perhaps perf_fetch_caller_regs not either. > > > > Note that not all large inlines are necessary bloat. If there is only > > > necessarily bloat? > > > > > > > > > > > a single call site it isn't bloat (the tool currently cannot distinguish > > that case). For example it is commonly seen with syscall definitions > > that use single large inlines with only a single caller. In the example > > above I think it's the case with __sched_setscheduler. > > > > Show the >1K inlines in lib/maple_tree.o, which for some reason > > comes in at a incredible 73k of code size: > > > > $ inline-account.py --min-bytes 1000 lib/maple_tree.o > > Total code bytes seen 73578 > > > > Code bytes by functions: > > Function Total Avg Num > > mas_mab_cp 5537 (0.08%) 37 149 > > mas_pop_node 3798 (0.05%) 28 131 > > ma_slots 2368 (0.03%) 14 162 > > ma_pivots 2353 (0.03%) 10 222 > > mas_destroy_rebalance 2056 (0.03%) 42 48 > > mas_start 1661 (0.02%) 13 125 > > mas_set_parent 1454 (0.02%) 20 72 > > mas_set_alloc_req 1410 (0.02%) 17 80 > > mte_node_type 1360 (0.02%) 5 228 > > mas_data_end 1189 (0.02%) 16 74 > > mte_to_node 1085 (0.01%) 3 276 > > mas_split 1053 (0.01%) 65 16 > > mas_topiary_replace 1033 (0.01%) 38 27 > > mas_root_expand 1001 (0.01%) 35 28 > > > > Code bytes by nearby source line blocks: > > prefix /home/ak/lsrc/git/linux/ > > Line Total > > lib/maple_tree.c:210 1360 (0.02%) > > include/trace/events/maple_tree.h:80 1283 (0.02%) > > lib/maple_tree.c:649 1193 (0.02%) > > lib/maple_tree.c:288 1097 (0.01%) > > > > It's clear there is a lot of potential for shrinking here, as a quick > > experiment shows: > > > > $ size lib/maple_tree.o > > text data bss dec hex filename > > 72257 5312 8 77577 12f09 lib/maple_tree.o > > $ sed -i -e s/__always_inline// -e 's/ inline/ /' lib/maple_tree.c > > $ make -s lib/maple_tree.o > > $ size lib/maple_tree.o > > text data bss dec hex filename > > 47774 4720 8 52502 cd16 lib/maple_tree.o > > > > 34% reduction just from trusting the compiler. Most of it seems > > to come from abuse of __always_inline. I suppose a large scale > > tree purge of that would give some decent binary size results. > > > > $ inline-account.py --show=5 kernel/workqueue.o > > Total code bytes seen 40403 > > > > Code bytes by functions: > > Function Total Avg Num > > bitmap_copy 1477 (0.04%) 26 56 > > show_pwq 912 (0.02%) 76 12 > > workqueue_init_early 846 (0.02%) 29 29 > > __flush_workqueue 753 (0.02%) 31 24 > > alloc_and_link_pwqs 558 (0.01%) 69 8 > > > > Code bytes by nearby source line blocks: > > prefix /home/ak/lsrc/git/linux/ > > Line Total > > include/linux/bitmap.h:268 1336 (0.03%) > > include/trace/events/workqueue.h:23 1038 (0.03%) > > include/trace/events/workqueue.h:108 732 (0.02%) > > include/trace/events/workqueue.h:59 694 (0.02%) > > include/trace/events/workqueue.h:82 670 (0.02%) > > $ > > > > This is an interesting case because bitmap_copy is just > > > > static inline void bitmap_copy(unsigned long *dst, const unsigned long *src, > > unsigned int nbits) > > { > > unsigned int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long); > > > > if (small_const_nbits(nbits)) > > *dst = *src; > > else > > memcpy(dst, src, len); > > } > > > > memcpy (which is a macro) must sometimes generate a lot of > > code. The small_const_nbits case definitely should be inlined though > > because it's likely even smaller than a call. Would need > > more investigation. > > > > The other large inlines are trace points. Perhaps there is something > > there that could be done to shrink that a bit. > > > > Finally we can do a global accounting (currently with multiple runs): > > > > (ignore the percentage numbers since they are just for the local file) > > > > $ find -name '*.o' | xargs -n1 inline-account.py > a > > $ sort -n -r -k 2 a | head -30 > > ZSTD_count 81799 (0.19%) 32 2514 > > ZSTD_count 52233 (0.25%) 33 1544 > > kmalloc 43324 (0.00%) 12 3334 > > pv_queued_spin_unlock 42027 (0.00%) 9 4580 > > constant_test_bit 41667 (0.00%) 5 8005 > > arch/x86/include/asm/paravirt.h:591 41044 (0.00%) > > arch/x86/include/asm/bitops.h:207 40153 (0.00%) > > __refcount_add 37968 (0.00%) 24 1532 > > page_fixed_fake_head 36368 (0.00%) 19 1832 > > include/linux/slab.h:599 35654 (0.00%) > > arch/x86/include/asm/jump_label.h:27 35156 (0.00%) > > spin_lock 32170 (0.00%) 10 3007 > > __refcount_sub_and_test 32068 (0.00%) 17 1842 > > include/linux/spinlock.h:351 31102 (0.00%) > > arch_static_branch 30874 (0.00%) 4 7022 > > get_current 30714 (0.00%) 9 3351 > > arch/x86/include/asm/current.h:41 29912 (0.00%) > > trace_trigger_soft_disabled 29814 (0.00%) 21 1368 > > perf_fetch_caller_regs 27504 (0.00%) 16 1634 > > ZSTD_storeSeq 26060 (0.06%) 30 862 > > hid_map_usage 25582 (0.00%) 88 288 > > ZSTD_compressBlock_lazy_generic 24953 (0.12%) 46 535 > > ZSTD_compressBlock_lazy_generic 24953 (0.06%) 46 535 > > paravirt_ret0 24152 (0.00%) 24152 1 > > spin_unlock_irqrestore 23253 (0.00%) 10 2281 > > include/linux/spinlock.h:406 22526 (0.00%) > > ZSTD_RowFindBestMatch 21527 (0.10%) 23 922 > > ZSTD_RowFindBestMatch 21527 (0.05%) 23 922 > > __list_add 21209 (0.00%) 11 1851 > > include/linux/refcount.h:283 20642 (0.00%) > > > > - So my kernel is spending around ~30K just for getting task_structs in > > current. > > - I'm sure ZSTD is great, but is it >200K in duplicated code worth great? > > - page_fixed_fake_head probably shouldn't be inlined > > - There might be some potential in out lining reference counts > > (although that one might be truly performance critical) > > - There's maybe some potential in shrinking trace point bloat? > > > > ... and more similar insights. > > > > Also of course there can be critical inlines that really need > > to be inline in many sites for best performance. But that's rarely the case > > if they are big because it's unlikely the small call overhead is making > > a significant difference for a large chunk of code. > > > > In any case the tool is useful, so I think it deserves its place > > in scripts/ > > > > Signed-off-by: Andi Kleen <ak@xxxxxxxxxxxxxxx> > > > > --- > > > > v2: Address review comments. Change balancing to be address based. > > Add option to set objdump binary. > > --- > > scripts/inline-account.py | 201 ++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 201 insertions(+) > > create mode 100755 scripts/inline-account.py > > > > diff --git a/scripts/inline-account.py b/scripts/inline-account.py > > new file mode 100755 > > index 000000000000..c32cb0547172 > > --- /dev/null > > +++ b/scripts/inline-account.py > > @@ -0,0 +1,201 @@ > > +#!/usr/bin/env python3 > > +# account code bytes per source code / functions from objdump -Sl output > > +# useful to find inline bloat > > +# > > +# SPDX-License-Identifier: GPL-2.0-only > > > Please move this above. > > The first line is the shebang. > The second line is SPDX. > > > > > > > +# Author: Andi Kleen > > + > > +import os > > +import sys > > +import re > > +import argparse > > +import bisect > > +import multiprocessing > > > Please sort imports alphabetically. > > > > > +from collections import Counter > > +from functools import reduce, partial > > + > > +def get_args(): > > + p = argparse.ArgumentParser( > > + description=""" > > +Account code bytes per source code / functions from objdump. > > +Useful to find inline bloat. > > + > > +The line numbers are the beginning of a block, so the actual code can be later. > > +Line numbers can be a also little off due to objdump bugs > > +also some misaccounting can happen due to inexact gcc debug information. > > +The number output for functions may account a single large function multiple > > +times. program/object files need to be built with -g. > > + > > +This is somewhat slow due to objdump -S being slow. It helps to have > > +plenty of cores.""") > > + p.add_argument('--min-bytes', type=int, help='minimum bytes to report', default=100) > > + p.add_argument('--threads', '-t', type=int, default=multiprocessing.cpu_count(), > > > Please check that args.threads >= 1. > > If not, please show an error message. > > > > > > > + help='Number of objdump processes to run') > > + p.add_argument('--verbose', '-v', action='store_true', help="Print more") > > + p.add_argument('--show', type=int, help='Number of results to show') > > + p.add_argument('--objdump', default="objdump", help="Set objdump binary to run") > > + p.add_argument('file', help='object file/program as input') > > + return p.parse_args() > > + > > +def get_syms(fn): > > + syms = [] > > + pc = None > > + with os.popen("nm -n --print-size " + fn) as f: > > + for l in f: > > + n = l.split() > > + if len(n) > 2 and n[2].upper() == "T": > > + pc = int(n[0], 16) > > + syms.append(pc) > > + ln = int(n[1], 16) > > + if not pc: > > + sys.exit(fn + " has no symbols") > > > "if not pc" becomes true in two cases: > - pc is zero > - pc is None > > > Only the latter case is "no symbols" because > a symbol address may become zero in relocatable ELF. > > > > if len(syms) == 0: > ... > > will work correctly. > > And, "pc = None" is unneeded. > > > > > > > > > > + syms.append(pc + ln) > > + return syms > > + > > +class Account: > > + def __init__(self): > > + self.funcbytes = Counter() > > + self.linebytes = Counter() > > + self.funccount = Counter() > > + self.nolinebytes = 0 > > + self.nofuncbytes = 0 > > + self.total = 0 > > + > > + def add(self, b): > > + self.funcbytes += b.funcbytes > > + self.linebytes += b.linebytes > > + self.funccount += b.funccount > > + self.nolinebytes += b.nolinebytes > > + self.nofuncbytes += b.nofuncbytes > > + self.total += b.total > > + return self > > + > > > Since this script implements a class, > please follow this blank line rule. > > > https://peps.python.org/pep-0008/#blank-lines > > > > > > > > > > > +# dont add sys.exit here, causes deadlocks > > +def account_range(args, r): > > + a = Account() > > + line = None > > + func = None > > + codefunc = None > > + > > + cmd = ("%s -Sl %s --start-address=%#x --stop-address=%#x" % > > + (args.objdump, args.file, r[0], r[1])) > > + if args.verbose: > > + print(cmd) > > + with os.popen(cmd) as f: > > + for l in f: > > + # 250: e8 00 00 00 00 callq 255 <proc_skip_spaces+0x5> > > + m = re.match(r'\s*([0-9a-fA-F]+):\s+(.*)', l) > > + if m: > > + bytes = len(re.findall(r'[0-9a-f][0-9a-f] ', m.group(2))) > > + if not func: > > + a.nofuncbytes += bytes > > + continue > > + if not line: > > + a.nolinebytes += bytes > > + continue > > + a.total += bytes > > + a.funcbytes[func] += bytes > > + a.linebytes[(file, line)] += bytes > > + codefunc = func > > + continue > > + > > + # sysctl_init(): > > + m = re.match(r'([a-zA-Z_][a-zA-Z0-9_]*)\(\):$', l) > > + if m: > > + if codefunc and m.group(1) != codefunc: > > + a.funccount[codefunc] += 1 > > + codefunc = None > > + func = m.group(1) > > + continue > > + > > + # /sysctl.c:1666 > > + m = re.match(r'^([^:]+):(\d+)$', l) > > + if m: > > + file, line = m.group(1), int(m.group(2)) > > + continue > > + > > + if codefunc: > > + a.funccount[codefunc] += 1 > > + return a > > + > > +def get_boundaries(syms, sym_sizes, chunk): > > + run = 0 > > + boundaries = [syms[0]] > > + for i, x in enumerate(sym_sizes): > > + run += x > > + if run >= chunk: > > + boundaries.append(syms[i]) > > + run = 0 > > + boundaries.append(syms[-1]) > > + return boundaries > > + > > + > > +def process(args): > > + # objdump -S is slow, so we parallelize > > + > > + # split symbol table into chunks for parallelization > > + # we split on functions boundaries to avoid mis-accounting > > + syms = get_syms(args.file) > > + if len(syms) < 2: > > + print("not enough symbols") > > + return > > This message should go to stderr instead of stdout. > > > sys.exit("not enough symbols") > Or, you can remove this check entirely if you check the length in get_syms(). get_syms() requires at least one element from the loop, and append() one more at the end. So, get_syms() always has two or more elements. -- Best Regards Masahiro Yamada