On Thu, Sep 14, 2023 at 10:14 AM Alan Maguire <alan.maguire@xxxxxxxxxx> wrote: > > On 14/09/2023 14:05, pengdonglin wrote: > > On 2023/9/14 20:46, Alan Maguire wrote: > >> On 14/09/2023 11:13, pengdonglin wrote: > >>> On 2023/9/13 21:34, Alan Maguire wrote: > >>>> On 13/09/2023 11:32, pengdonglin wrote: > >>>>> On 2023/9/13 2:46, Alexei Starovoitov wrote: > >>>>>> On Tue, Sep 12, 2023 at 10:03 AM Eduard Zingerman <eddyz87@xxxxxxxxx> > >>>>>> wrote: > >>>>>>> > >>>>>>> On Tue, 2023-09-12 at 09:40 -0700, Alexei Starovoitov wrote: > >>>>>>>> On Tue, Sep 12, 2023 at 7:19 AM Eduard Zingerman > >>>>>>>> <eddyz87@xxxxxxxxx> > >>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> On Tue, 2023-09-12 at 16:51 +0300, Eduard Zingerman wrote: > >>>>>>>>>> On Sat, 2023-09-09 at 02:16 -0700, Donglin Peng wrote: > >>>>>>>>>>> Currently, we are only using the linear search method to find > >>>>>>>>>>> the > >>>>>>>>>>> type id > >>>>>>>>>>> by the name, which has a time complexity of O(n). This change > >>>>>>>>>>> involves > >>>>>>>>>>> sorting the names of btf types in ascending order and using > >>>>>>>>>>> binary search, > >>>>>>>>>>> which has a time complexity of O(log(n)). This idea was inspired > >>>>>>>>>>> by the > >>>>>>>>>>> following patch: > >>>>>>>>>>> > >>>>>>>>>>> 60443c88f3a8 ("kallsyms: Improve the performance of > >>>>>>>>>>> kallsyms_lookup_name()"). > >>>>>>>>>>> > >>>>>>>>>>> At present, this improvement is only for searching in vmlinux's > >>>>>>>>>>> and > >>>>>>>>>>> module's BTFs, and the kind should only be BTF_KIND_FUNC or > >>>>>>>>>>> BTF_KIND_STRUCT. > >>>>>>>>>>> > >>>>>>>>>>> Another change is the search direction, where we search the BTF > >>>>>>>>>>> first and > >>>>>>>>>>> then its base, the type id of the first matched btf_type will be > >>>>>>>>>>> returned. > >>>>>>>>>>> > >>>>>>>>>>> Here is a time-consuming result that finding all the type ids of > >>>>>>>>>>> 67,819 kernel > >>>>>>>>>>> functions in vmlinux's BTF by their names: > >>>>>>>>>>> > >>>>>>>>>>> Before: 17000 ms > >>>>>>>>>>> After: 10 ms > >>>>>>>>>>> > >>>>>>>>>>> The average lookup performance has improved about 1700x at the > >>>>>>>>>>> above scenario. > >>>>>>>>>>> > >>>>>>>>>>> However, this change will consume more memory, for example, > >>>>>>>>>>> 67,819 kernel > >>>>>>>>>>> functions will allocate about 530KB memory. > >>>>>>>>>> > >>>>>>>>>> Hi Donglin, > >>>>>>>>>> > >>>>>>>>>> I think this is a good improvement. However, I wonder, why did > >>>>>>>>>> you > >>>>>>>>>> choose to have a separate name map for each BTF kind? > >>>>>>>>>> > >>>>>>>>>> I did some analysis for my local testing kernel config and got > >>>>>>>>>> such numbers: > >>>>>>>>>> - total number of BTF objects: 97350 > >>>>>>>>>> - number of FUNC and STRUCT objects: 51597 > >>>>>>>>>> - number of FUNC, STRUCT, UNION, ENUM, ENUM64, TYPEDEF, DATASEC > >>>>>>>>>> objects: 56817 > >>>>>>>>>> (these are all kinds for which lookup by name might make > >>>>>>>>>> sense) > >>>>>>>>>> - number of named objects: 54246 > >>>>>>>>>> - number of name collisions: > >>>>>>>>>> - unique names: 53985 counts > >>>>>>>>>> - 2 objects with the same name: 129 counts > >>>>>>>>>> - 3 objects with the same name: 3 counts > >>>>>>>>>> > >>>>>>>>>> So, it appears that having a single map for all named objects > >>>>>>>>>> makes > >>>>>>>>>> sense and would also simplify the implementation, what do you > >>>>>>>>>> think? > >>>>>>>>> > >>>>>>>>> Some more numbers for my config: > >>>>>>>>> - 13241 types (struct, union, typedef, enum), log2 13241 = 13.7 > >>>>>>>>> - 43575 funcs, log2 43575 = 15.4 > >>>>>>>>> Thus, having separate map for types vs functions might save ~1.7 > >>>>>>>>> search iterations. Is this a significant slowdown in practice? > >>>>>>>> > >>>>>>>> What do you propose to do in case of duplicates ? > >>>>>>>> func and struct can have the same name, but they will have two > >>>>>>>> different > >>>>>>>> btf_ids. How do we store them ? > >>>>>>>> Also we might add global vars to BTF. Such request came up several > >>>>>>>> times. > >>>>>>>> So we need to make sure our search approach scales to > >>>>>>>> func, struct, vars. I don't recall whether we search any other > >>>>>>>> kinds. > >>>>>>>> Separate arrays for different kinds seems ok. > >>>>>>>> It's a bit of code complexity, but it's not an increase in memory. > >>>>>>> > >>>>>>> Binary search gives, say, lowest index of a thing with name A, then > >>>>>>> increment index while name remains A looking for correct kind. > >>>>>>> Given the name conflicts info from above, 99% of times there > >>>>>>> would be > >>>>>>> no need to iterate and in very few cases there would a couple of > >>>>>>> iterations. > >>>>>>> > >>>>>>> Same logic would be necessary with current approach if different BTF > >>>>>>> kinds would be allowed in BTF_ID_NAME_* cohorts. I figured that > >>>>>>> these > >>>>>>> cohorts are mainly a way to split the tree for faster lookups, but > >>>>>>> maybe that is not the main intent. > >>>>>>> > >>>>>>>> With 13k structs and 43k funcs it's 56k * (4 + 4) that's 0.5 Mbyte > >>>>>>>> extra memory. That's quite a bit. Anything we can do to compress > >>>>>>>> it? > >>>>>>> > >>>>>>> That's an interesting question, from the top of my head: > >>>>>>> pre-sort in pahole (re-assign IDs so that increasing ID also would > >>>>>>> mean "increasing" name), shouldn't be that difficult. > >>>>>> > >>>>>> That sounds great. kallsyms are pre-sorted at build time. > >>>>>> We should do the same with BTF. > >>>>>> I think GCC can emit BTF directly now and LLVM emits it for bpf progs > >>>>>> too, > >>>>>> but since vmlinux and kernel module BTFs will keep being processed > >>>>>> through pahole we don't have to make gcc/llvm sort things right away. > >>>>>> pahole will be enough. The kernel might do 'is it sorted' check > >>>>>> during BTF validation and then use binary search or fall back to > >>>>>> linear > >>>>>> when not-sorted == old pahole. > >>>>>> > >>>>> > >>>>> Yeah, I agree and will attempt to modify the pahole and perform a > >>>>> test. > >>>>> Do we need > >>>>> to introduce a new macro to control the behavior when the BTF is not > >>>>> sorted? If > >>>>> it is not sorted, we can use the method mentioned in this patch or use > >>>>> linear > >>>>> search. > >>>>> > >>>>> > >>>> > >>>> One challenge with pahole is that it often runs in parallel mode, so I > >>>> suspect any sorting would have to be done after merging across threads. > >>>> Perhaps BTF deduplication time might be a useful time to re-sort by > >>>> name? BTF dedup happens after BTF has been merged, and a new "sorted" > >>>> btf_dedup_opts option could be added and controlled by a pahole > >>>> option. However dedup is pretty complicated already.. > >>>> > >>>> One thing we should weigh up though is if there are benefits to the > >>>> way BTF is currently laid out. It tends to start with base types, > >>>> and often-encountered types end up being located towards the start > >>>> of the BTF data. For example > >>>> > >>>> > >>>> [1] INT 'long unsigned int' size=8 bits_offset=0 nr_bits=64 > >>>> encoding=(none) > >>>> [2] CONST '(anon)' type_id=1 > >>>> [3] VOLATILE '(anon)' type_id=1 > >>>> [4] ARRAY '(anon)' type_id=1 index_type_id=21 nr_elems=2 > >>>> [5] PTR '(anon)' type_id=8 > >>>> [6] CONST '(anon)' type_id=5 > >>>> [7] INT 'char' size=1 bits_offset=0 nr_bits=8 encoding=SIGNED > >>>> [8] CONST '(anon)' type_id=7 > >>>> [9] INT 'unsigned int' size=4 bits_offset=0 nr_bits=32 encoding=(none) > >>>> [10] CONST '(anon)' type_id=9 > >>>> [11] TYPEDEF '__s8' type_id=12 > >>>> [12] INT 'signed char' size=1 bits_offset=0 nr_bits=8 encoding=SIGNED > >>>> [13] TYPEDEF '__u8' type_id=14 > >>>> > >>>> So often-used types will be found quickly, even under linear search > >>>> conditions. > >>> > >>> I found that there seems to be no code in the kernel that get the ID > >>> of the > >>> basic data type by calling btf_find_by_name_kind directly. The general > >>> usage > >>> of this function is to obtain the ID of a structure or function. After > >>> we got > >>> the ID of a structure or function, it is O(1) to get the IDs of its > >>> members > >>> or parameters. > >>> > >>> ./kernel/trace/trace_probe.c:383: id = btf_find_by_name_kind(btf, > >>> funcname, BTF_KIND_FUNC); > >>> ./kernel/bpf/btf.c:3523: id = btf_find_by_name_kind(btf, > >>> value_type, BTF_KIND_STRUCT); > >>> ./kernel/bpf/btf.c:5504: id = btf_find_by_name_kind(btf, > >>> alloc_obj_fields[i], BTF_KIND_STRUCT); > >>> ./kernel/bpf/bpf_struct_ops.c:128: module_id = > >>> btf_find_by_name_kind(btf, "module", BTF_KIND_STRUCT); > >>> ./net/ipv4/bpf_tcp_ca.c:28: type_id = btf_find_by_name_kind(btf, > >>> "sock", BTF_KIND_STRUCT); > >>> ./net/ipv4/bpf_tcp_ca.c:33: type_id = btf_find_by_name_kind(btf, > >>> "tcp_sock", BTF_KIND_STRUCT); > >>> ./net/netfilter/nf_bpf_link.c:181: type_id = > >>> btf_find_by_name_kind(btf, name, BTF_KIND_STRUCT); > >>> > >>>> > >>>> When we look at how many lookups by id (which are O(1), since they are > >>>> done via the btf->types[] array) versus by name, we see: > >>>> > >>>> $ grep btf_type_by_id kernel/bpf/*.c|wc -l > >>>> 120 > >>>> $ grep btf_find_by_nam kernel/bpf/*.c|wc -l > >>>> 15 > >>>> > >>>> I don't see a huge number of name-based lookups, and I think most are > >>>> outside of the hotter codepaths, unless I'm missing some. All of which > >>>> is to say it would be a good idea to have a clear sense of what will > >>>> get > >>>> faster with sorted-by-name BTF. Thanks! > >>> > >>> The story goes like this. > >>> > >>> I have added a new feature to the function graph called > >>> "funcgraph_retval", > >>> here is the link: > >>> > >>> https://lore.kernel.org/all/1fc502712c981e0e6742185ba242992170ac9da8.1680954589.git.pengdonglin@xxxxxxxxxxxxxx/ > >>> > >>> We can obtain the return values of almost every function during the > >>> execution > >>> of kernel through this feature, it can help us analyze problems. > >>> > >> > >> It's a great feature! > > > > Thanks. > > > >> > >>> However, this feature has two main drawbacks. > >>> > >>> 1. Even if a function's return type is void, a return value will still > >>> be printed. > >>> > >>> 2. The return value printed may be incorrect when the width of the > >>> return type is > >>> smaller than the generic register. > >>> > >>> I think if we can get this return type of the function, then the > >>> drawbacks mentioned > >>> above can be eliminated. The function btf_find_by_name_kind can be used > >>> to get the ID of > >>> the kernel function, then we can get its return type easily. If the > >>> return type is > >>> void, the return value recorded will not be printed. If the width of the > >>> return type > >>> is smaller than the generic register, then the value stored in the upper > >>> bits will be > >>> trimmed. I have written a demo and these drawbacks were resolved. > >>> > >>> However, during my test, I found that it took too much time when read > >>> the trace log > >>> with this feature enabled, because the trace log consists of 200,000 > >>> lines. The > >>> majority of the time was consumed by the btf_find_by_name_kind, which is > >>> called > >>> 200,000 times. > >>> > >>> So I think the performance of btf_find_by_name_kind may need to be > >>> improved. > >>> > >> > >> If I recall, Masami's work uses BTF ids, but can cache them since the > >> user explicitly asks for specific fields in the trace output. I'm > >> presuming that's not an option for you due to the fact funcgraph tracing > >> enables everything (or at least everything under a filter predicate) and > >> you have limited context to work with, is that right? > > > > Yes, right. > > > >> > >> Looking at print_graph_entry_leaf() which I _think_ is where you'd need > >> to print the retval from, you have access to the function address via > >> call->func, and I presume you get the name from snprinting the symbol to > >> a string or similar. So you're stuck in a context where you have the > >> function address, and from that you can derive the function name. Is > >> that correct? Thanks! > > > > Yes, both print_graph_return and print_graph_entry_leaf will call > > print_graph_retval > > to print the return value. Then call sprint_symbol_no_offset with > > call->func to get > > the function name, then call btf_find_by_name_kind to get the return type. > > > > So what you ultimately need is a way to map from that information > available to be able to figure out the size of the return value > associated with a function. > > On the BPF side we've been thinking a bit about the relationship between > kernel function addresses and their BTF representations; sometimes > knowing BTF->address mapping is needed for cases where the same function > name has multiple inconsistent function signatures in BTF. We don't > represent function addresses yet in BTF but may end up having to. > The reason I mention this is in an ideal world, it would benefit to > populate kallsyms entries with their associated BTF ids; I think it would be cleaner to keep addresses in BTF. Since we might have same btf_id func with multiple addresses and same name, but different btf_id with multi address too. I suspect one to one kallsym to btf_id won't cover all cases. Also we search BTFs not only for vars/funcs, but for types too. It's better to optimize btf_find_by_name_kind() that it's fast for finding structs by name too. imo Eduard's proposal to sort all BTFs by name after dedup is the best. I don't think sorting will add a noticeable build time increase. We don't need to add pahole flags either. The kernel can handle sorted vs non-sorted BTFs transparently.