Re: [RFC PATCH v2] bpf: Using binary search to improve the performance of btf_find_by_name_kind

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Thu, 14 Sep 2023 15:07:43 -0700

On Thu, Sep 14, 2023 at 10:14 AM Alan Maguire <alan.maguire@xxxxxxxxxx> wrote:
>
> On 14/09/2023 14:05, pengdonglin wrote:
> > On 2023/9/14 20:46, Alan Maguire wrote:
> >> On 14/09/2023 11:13, pengdonglin wrote:
> >>> On 2023/9/13 21:34, Alan Maguire wrote:
> >>>> On 13/09/2023 11:32, pengdonglin wrote:
> >>>>> On 2023/9/13 2:46, Alexei Starovoitov wrote:
> >>>>>> On Tue, Sep 12, 2023 at 10:03 AM Eduard Zingerman <eddyz87@xxxxxxxxx>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> On Tue, 2023-09-12 at 09:40 -0700, Alexei Starovoitov wrote:
> >>>>>>>> On Tue, Sep 12, 2023 at 7:19 AM Eduard Zingerman
> >>>>>>>> <eddyz87@xxxxxxxxx>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> On Tue, 2023-09-12 at 16:51 +0300, Eduard Zingerman wrote:
> >>>>>>>>>> On Sat, 2023-09-09 at 02:16 -0700, Donglin Peng wrote:
> >>>>>>>>>>> Currently, we are only using the linear search method to find
> >>>>>>>>>>> the
> >>>>>>>>>>> type id
> >>>>>>>>>>> by the name, which has a time complexity of O(n). This change
> >>>>>>>>>>> involves
> >>>>>>>>>>> sorting the names of btf types in ascending order and using
> >>>>>>>>>>> binary search,
> >>>>>>>>>>> which has a time complexity of O(log(n)). This idea was inspired
> >>>>>>>>>>> by the
> >>>>>>>>>>> following patch:
> >>>>>>>>>>>
> >>>>>>>>>>> 60443c88f3a8 ("kallsyms: Improve the performance of
> >>>>>>>>>>> kallsyms_lookup_name()").
> >>>>>>>>>>>
> >>>>>>>>>>> At present, this improvement is only for searching in vmlinux's
> >>>>>>>>>>> and
> >>>>>>>>>>> module's BTFs, and the kind should only be BTF_KIND_FUNC or
> >>>>>>>>>>> BTF_KIND_STRUCT.
> >>>>>>>>>>>
> >>>>>>>>>>> Another change is the search direction, where we search the BTF
> >>>>>>>>>>> first and
> >>>>>>>>>>> then its base, the type id of the first matched btf_type will be
> >>>>>>>>>>> returned.
> >>>>>>>>>>>
> >>>>>>>>>>> Here is a time-consuming result that finding all the type ids of
> >>>>>>>>>>> 67,819 kernel
> >>>>>>>>>>> functions in vmlinux's BTF by their names:
> >>>>>>>>>>>
> >>>>>>>>>>> Before: 17000 ms
> >>>>>>>>>>> After:     10 ms
> >>>>>>>>>>>
> >>>>>>>>>>> The average lookup performance has improved about 1700x at the
> >>>>>>>>>>> above scenario.
> >>>>>>>>>>>
> >>>>>>>>>>> However, this change will consume more memory, for example,
> >>>>>>>>>>> 67,819 kernel
> >>>>>>>>>>> functions will allocate about 530KB memory.
> >>>>>>>>>>
> >>>>>>>>>> Hi Donglin,
> >>>>>>>>>>
> >>>>>>>>>> I think this is a good improvement. However, I wonder, why did
> >>>>>>>>>> you
> >>>>>>>>>> choose to have a separate name map for each BTF kind?
> >>>>>>>>>>
> >>>>>>>>>> I did some analysis for my local testing kernel config and got
> >>>>>>>>>> such numbers:
> >>>>>>>>>> - total number of BTF objects: 97350
> >>>>>>>>>> - number of FUNC and STRUCT objects: 51597
> >>>>>>>>>> - number of FUNC, STRUCT, UNION, ENUM, ENUM64, TYPEDEF, DATASEC
> >>>>>>>>>> objects: 56817
> >>>>>>>>>>      (these are all kinds for which lookup by name might make
> >>>>>>>>>> sense)
> >>>>>>>>>> - number of named objects: 54246
> >>>>>>>>>> - number of name collisions:
> >>>>>>>>>>      - unique names: 53985 counts
> >>>>>>>>>>      - 2 objects with the same name: 129 counts
> >>>>>>>>>>      - 3 objects with the same name: 3 counts
> >>>>>>>>>>
> >>>>>>>>>> So, it appears that having a single map for all named objects
> >>>>>>>>>> makes
> >>>>>>>>>> sense and would also simplify the implementation, what do you
> >>>>>>>>>> think?
> >>>>>>>>>
> >>>>>>>>> Some more numbers for my config:
> >>>>>>>>> - 13241 types (struct, union, typedef, enum), log2 13241 = 13.7
> >>>>>>>>> - 43575 funcs, log2 43575 = 15.4
> >>>>>>>>> Thus, having separate map for types vs functions might save ~1.7
> >>>>>>>>> search iterations. Is this a significant slowdown in practice?
> >>>>>>>>
> >>>>>>>> What do you propose to do in case of duplicates ?
> >>>>>>>> func and struct can have the same name, but they will have two
> >>>>>>>> different
> >>>>>>>> btf_ids. How do we store them ?
> >>>>>>>> Also we might add global vars to BTF. Such request came up several
> >>>>>>>> times.
> >>>>>>>> So we need to make sure our search approach scales to
> >>>>>>>> func, struct, vars. I don't recall whether we search any other
> >>>>>>>> kinds.
> >>>>>>>> Separate arrays for different kinds seems ok.
> >>>>>>>> It's a bit of code complexity, but it's not an increase in memory.
> >>>>>>>
> >>>>>>> Binary search gives, say, lowest index of a thing with name A, then
> >>>>>>> increment index while name remains A looking for correct kind.
> >>>>>>> Given the name conflicts info from above, 99% of times there
> >>>>>>> would be
> >>>>>>> no need to iterate and in very few cases there would a couple of
> >>>>>>> iterations.
> >>>>>>>
> >>>>>>> Same logic would be necessary with current approach if different BTF
> >>>>>>> kinds would be allowed in BTF_ID_NAME_* cohorts. I figured that
> >>>>>>> these
> >>>>>>> cohorts are mainly a way to split the tree for faster lookups, but
> >>>>>>> maybe that is not the main intent.
> >>>>>>>
> >>>>>>>> With 13k structs and 43k funcs it's 56k * (4 + 4) that's 0.5 Mbyte
> >>>>>>>> extra memory. That's quite a bit. Anything we can do to compress
> >>>>>>>> it?
> >>>>>>>
> >>>>>>> That's an interesting question, from the top of my head:
> >>>>>>> pre-sort in pahole (re-assign IDs so that increasing ID also would
> >>>>>>> mean "increasing" name), shouldn't be that difficult.
> >>>>>>
> >>>>>> That sounds great. kallsyms are pre-sorted at build time.
> >>>>>> We should do the same with BTF.
> >>>>>> I think GCC can emit BTF directly now and LLVM emits it for bpf progs
> >>>>>> too,
> >>>>>> but since vmlinux and kernel module BTFs will keep being processed
> >>>>>> through pahole we don't have to make gcc/llvm sort things right away.
> >>>>>> pahole will be enough. The kernel might do 'is it sorted' check
> >>>>>> during BTF validation and then use binary search or fall back to
> >>>>>> linear
> >>>>>> when not-sorted == old pahole.
> >>>>>>
> >>>>>
> >>>>> Yeah, I agree and will attempt to modify the pahole and perform a
> >>>>> test.
> >>>>> Do we need
> >>>>> to introduce a new macro to control the behavior when the BTF is not
> >>>>> sorted? If
> >>>>> it is not sorted, we can use the method mentioned in this patch or use
> >>>>> linear
> >>>>> search.
> >>>>>
> >>>>>
> >>>>
> >>>> One challenge with pahole is that it often runs in parallel mode, so I
> >>>> suspect any sorting would have to be done after merging across threads.
> >>>> Perhaps BTF deduplication time might be a useful time to re-sort by
> >>>> name? BTF dedup happens after BTF has been merged, and a new "sorted"
> >>>> btf_dedup_opts option could be added and controlled by a pahole
> >>>> option. However dedup is pretty complicated already..
> >>>>
> >>>> One thing we should weigh up though is if there are benefits to the
> >>>> way BTF is currently laid out. It tends to start with base types,
> >>>> and often-encountered types end up being located towards the start
> >>>> of the BTF data. For example
> >>>>
> >>>>
> >>>> [1] INT 'long unsigned int' size=8 bits_offset=0 nr_bits=64
> >>>> encoding=(none)
> >>>> [2] CONST '(anon)' type_id=1
> >>>> [3] VOLATILE '(anon)' type_id=1
> >>>> [4] ARRAY '(anon)' type_id=1 index_type_id=21 nr_elems=2
> >>>> [5] PTR '(anon)' type_id=8
> >>>> [6] CONST '(anon)' type_id=5
> >>>> [7] INT 'char' size=1 bits_offset=0 nr_bits=8 encoding=SIGNED
> >>>> [8] CONST '(anon)' type_id=7
> >>>> [9] INT 'unsigned int' size=4 bits_offset=0 nr_bits=32 encoding=(none)
> >>>> [10] CONST '(anon)' type_id=9
> >>>> [11] TYPEDEF '__s8' type_id=12
> >>>> [12] INT 'signed char' size=1 bits_offset=0 nr_bits=8 encoding=SIGNED
> >>>> [13] TYPEDEF '__u8' type_id=14
> >>>>
> >>>> So often-used types will be found quickly, even under linear search
> >>>> conditions.
> >>>
> >>> I found that there seems to be no code in the kernel that get the ID
> >>> of the
> >>> basic data type by calling btf_find_by_name_kind directly. The general
> >>> usage
> >>> of this function is to obtain the ID of a structure or function. After
> >>> we got
> >>> the ID of a structure or function, it is O(1) to get the IDs of its
> >>> members
> >>> or parameters.
> >>>
> >>> ./kernel/trace/trace_probe.c:383:       id = btf_find_by_name_kind(btf,
> >>> funcname, BTF_KIND_FUNC);
> >>> ./kernel/bpf/btf.c:3523:        id = btf_find_by_name_kind(btf,
> >>> value_type, BTF_KIND_STRUCT);
> >>> ./kernel/bpf/btf.c:5504:                id = btf_find_by_name_kind(btf,
> >>> alloc_obj_fields[i], BTF_KIND_STRUCT);
> >>> ./kernel/bpf/bpf_struct_ops.c:128:      module_id =
> >>> btf_find_by_name_kind(btf, "module", BTF_KIND_STRUCT);
> >>> ./net/ipv4/bpf_tcp_ca.c:28:     type_id = btf_find_by_name_kind(btf,
> >>> "sock", BTF_KIND_STRUCT);
> >>> ./net/ipv4/bpf_tcp_ca.c:33:     type_id = btf_find_by_name_kind(btf,
> >>> "tcp_sock", BTF_KIND_STRUCT);
> >>> ./net/netfilter/nf_bpf_link.c:181:      type_id =
> >>> btf_find_by_name_kind(btf, name, BTF_KIND_STRUCT);
> >>>
> >>>>
> >>>> When we look at how many lookups by id (which are O(1), since they are
> >>>> done via the btf->types[] array) versus by name, we see:
> >>>>
> >>>> $ grep btf_type_by_id kernel/bpf/*.c|wc -l
> >>>> 120
> >>>> $ grep btf_find_by_nam kernel/bpf/*.c|wc -l
> >>>> 15
> >>>>
> >>>> I don't see a huge number of name-based lookups, and I think most are
> >>>> outside of the hotter codepaths, unless I'm missing some. All of which
> >>>> is to say it would be a good idea to have a clear sense of what will
> >>>> get
> >>>> faster with sorted-by-name BTF. Thanks!
> >>>
> >>> The story goes like this.
> >>>
> >>> I have added a new feature to the function graph called
> >>> "funcgraph_retval",
> >>> here is the link:
> >>>
> >>> https://lore.kernel.org/all/1fc502712c981e0e6742185ba242992170ac9da8.1680954589.git.pengdonglin@xxxxxxxxxxxxxx/
> >>>
> >>> We can obtain the return values of almost every function during the
> >>> execution
> >>> of kernel through this feature, it can help us analyze problems.
> >>>
> >>
> >> It's a great feature!
> >
> > Thanks.
> >
> >>
> >>> However, this feature has two main drawbacks.
> >>>
> >>> 1. Even if a function's return type is void,  a return value will still
> >>> be printed.
> >>>
> >>> 2. The return value printed may be incorrect when the width of the
> >>> return type is
> >>> smaller than the generic register.
> >>>
> >>> I think if we can get this return type of the function, then the
> >>> drawbacks mentioned
> >>> above can be eliminated. The function btf_find_by_name_kind can be used
> >>> to get the ID of
> >>> the kernel function, then we can get its return type easily. If the
> >>> return type is
> >>> void, the return value recorded will not be printed. If the width of the
> >>> return type
> >>> is smaller than the generic register, then the value stored in the upper
> >>> bits will be
> >>> trimmed. I have written a demo and these drawbacks were resolved.
> >>>
> >>> However, during my test, I found that it took too much time when read
> >>> the trace log
> >>> with this feature enabled, because the trace log consists of 200,000
> >>> lines. The
> >>> majority of the time was consumed by the btf_find_by_name_kind, which is
> >>> called
> >>> 200,000 times.
> >>>
> >>> So I think the performance of btf_find_by_name_kind  may need to be
> >>> improved.
> >>>
> >>
> >> If I recall, Masami's work uses BTF ids, but can cache them since the
> >> user explicitly asks for specific fields in the trace output. I'm
> >> presuming that's not an option for you due to the fact funcgraph tracing
> >> enables everything (or at least everything under a filter predicate) and
> >> you have limited context to work with, is that right?
> >
> > Yes, right.
> >
> >>
> >> Looking at print_graph_entry_leaf() which I _think_ is where you'd need
> >> to print the retval from, you have access to the function address via
> >> call->func, and I presume you get the name from snprinting the symbol to
> >> a string or similar. So you're stuck in a context where you have the
> >> function address, and from that you can derive the function name. Is
> >> that correct? Thanks!
> >
> > Yes, both print_graph_return and print_graph_entry_leaf will call
> > print_graph_retval
> > to print the return value. Then call sprint_symbol_no_offset with
> > call->func to get
> > the function name, then call btf_find_by_name_kind to get the return type.
> >
>
> So what you ultimately need is a way to map from that information
> available to be able to figure out the size of the return value
> associated with a function.
>
> On the BPF side we've been thinking a bit about the relationship between
> kernel function addresses and their BTF representations; sometimes
> knowing BTF->address mapping is needed for cases where the same function
> name has multiple inconsistent function signatures in BTF. We don't
> represent function addresses yet in BTF but may end up having to.
> The reason I mention this is in an ideal world, it would benefit to
> populate kallsyms entries with their associated BTF ids;

I think it would be cleaner to keep addresses in BTF.
Since we might have same btf_id func with multiple addresses
and same name, but different btf_id with multi address too.
I suspect one to one kallsym to btf_id won't cover all cases.

Also we search BTFs not only for vars/funcs, but for types too.
It's better to optimize btf_find_by_name_kind() that it's fast
for finding structs by name too.

imo Eduard's proposal to sort all BTFs by name after dedup is the best.
I don't think sorting will add a noticeable build time increase.
We don't need to add pahole flags either.
The kernel can handle sorted vs non-sorted BTFs transparently.