On Tue, Aug 9, 2022 at 1:25 AM houtao <houtao@xxxxxxxxxxxxxxx> wrote: > > Hi, > > On 8/3/2022 6:38 AM, Andrii Nakryiko wrote: > > On Tue, Jul 26, 2022 at 5:42 AM Hou Tao <houtao1@xxxxxxxxxx> wrote: > >> Hi, > >> > > Hey, sorry I'm catching up on upstream and there is just too many > > complicated patch sets coming in, so it takes time to get through > > them. > > > > I think you did a great job with this implementation, it's certainly > > worth submitting as non-RFC for proper upstream review. I know that > > some people didn't get your patches, they got into spam somehow. So I > > think it would be great to just resubmit it as non-RFC so that it > > appears in patchworks as to-be-reviewew patches and hopefully will get > > a wider audience to review this. > Thanks for your encouragement. > The spam email is due to some misconfiguration in email servers of our company, > but it can not be fixed soon, so I change to use another email account and hope > that will fix the spam problem. > > > I've tried to answer some questions below, but would definitely like > > more people to chime in. I haven't went through implementation in > > details, but superficially it looks pretty clean and certainly ready > > for proper non-RFC review. > > > > One point about user API would be to maybe instead use bpf_dynptr as > > an interface for specifying variable-sized lookup key instead of > > hard-coded > > bpf_qp_trie_key. Please check recent work by Joanne on bpf_dynptr. > Will check how to archive that. Does that means we will have real > variable-length key instead of fixed-allocated-length key with > variable-used-length ? Yes, check bpf_dynptr patches. We are still working on expanding bpf_dynptr use cases and API, but all the pieces needed for variable-sized keys are already in place. > > In short: looks great, I think it's certainly worth adding this as BPF > > map type. Please submit as non-RFC and go through a proper review > > process. Looking forward (even if that means reviewing 1000k lines of > > dense algorithmic code :) ). > Thanks. :) > >> The initial motivation for qp-trie map is to reduce memory usage for > >> string keys special those with large differencies in length as > >> discussed in [0]. And as a big-endian lexicographical ordered map, it > >> can also be used for any binary data with fixed or variable length. > >> > >> Now the basic functionality of qp-trie is ready, so posting a RFC version > >> to get more feedback or suggestions about qp-trie. Specially feedback > >> about the following questions: > >> > >> (1) Application scenario for qp-trie > >> Andrii had proposed to re-implement lpm-trie by using qp-trie. The > >> advantage would be the speed up of lookup operations due to lower tree > >> depth of qp-trie. Maybe the performance of update could also be improved > >> although in cillium there is a big lock during lpm-trie update [1]. Is > > Well, using qp-trie approach as an internal implementation of lpm-trie > > is probably a major win already, what's wrong with that? > I am OK with that. My concern is that you had mentioned that qp-trie could be > used to remove the global lock in lpm-trie, but now there are still subtree > locks in qp-trie, so I am not sure whether the scalability of qp-trie is still a > problem. I had searched users of lpm-trie in github and only found Cilium. In > its source code [0], the update and deletion procedure of lpm-trie are protected > by a big lock, so may be the global lock is not a big issue right now. > I think typically users just stay away from lpm-trie due to its slowness. Once this is addressed, we'll have more use cases, as the semantics of this BPF map is useful in a bunch of practical use cases. So yes, it's very much an issue, overall. > [0]: > https://github.com/cilium/cilium/blob/5145e31cd65db3361f6538d5f5f899440b769070/pkg/datapath/prefilter/prefilter.go#L123 > >> there any other use cases for qp-trie ? Specially those cases which need > >> both ordering and memory efficiency or cases in which jhash() of htab > >> creates too much collisions and qp-trie lookup performances better than > >> hash-table lookup as shown below: > > I'm thinking about qp-trie as dynamically growable lookup table. > > That's a pretty big use case already. There is an RB tree proposal > > under review right now which would also satisfy dynamically growable > > criteria, but its focus is slightly different and it remains to be > > seen how convenient it will be as general-purpose resizable > > alternative to hashmap. But from benchmarks I've found online, RB tree > > will definitely use more memory than qp-trie. > qp-trie is also RCU-read safe which is hard for rb-tree to archive. I also hope > qp-trie will have better lookup performance than rb-tree. > > Ordered property seems also very useful, but I don't yet have specific > > use case for that. But once we have data structure like this in BPF, > > I'm sure use cases will pop up. > For ordered map, next() and prev() helpers will be useful, but now these is no > such support. Embedded rb_node in program-defined structure may make it easier > to implement such helpers. Maybe we can also add such helpers for qp-trie in the > next-step ? > See bpf_for_each_map_elem() helper, we already have ability to iterate BPF map elements in *some* order, but in this case for qp-trie those elements will be guaranteed to be in sorted order. > >> Randomly-generated binary data with variable length (length range=[1, 256] entries=16K) > >> > >> htab lookup (1 thread) 5.062 ± 0.004M/s (drops 0.002 ± 0.000M/s mem 8.125 MiB) > >> htab lookup (2 thread) 10.256 ± 0.017M/s (drops 0.006 ± 0.000M/s mem 8.114 MiB) > >> htab lookup (4 thread) 20.383 ± 0.006M/s (drops 0.009 ± 0.000M/s mem 8.117 MiB) > >> htab lookup (8 thread) 40.727 ± 0.093M/s (drops 0.010 ± 0.000M/s mem 8.123 MiB) > >> htab lookup (16 thread) 81.333 ± 0.311M/s (drops 0.020 ± 0.000M/s mem 8.122 MiB) > >> > >> qp-trie lookup (1 thread) 10.161 ± 0.008M/s (drops 0.006 ± 0.000M/s mem 4.847 MiB) > >> qp-trie lookup (2 thread) 20.287 ± 0.024M/s (drops 0.007 ± 0.000M/s mem 4.828 MiB) > >> qp-trie lookup (4 thread) 40.784 ± 0.020M/s (drops 0.015 ± 0.000M/s mem 4.071 MiB) > >> qp-trie lookup (8 thread) 81.165 ± 0.013M/s (drops 0.040 ± 0.000M/s mem 4.045 MiB) > >> qp-trie lookup (16 thread) 159.955 ± 0.014M/s (drops 0.108 ± 0.000M/s mem 4.495 MiB) > >> > > Kind of surprised that qp-tire is twice as fast as hashmap. Do you > > have any idea why hashmap is slower? Was htab pre-allocated? What was > > it's max_entries? > The hash table is not pre-allocated and pre-allocation doesn't help for lookup > performance in the benchmark. max_entries is 16K, and if set max_entries to 4K > or 128K, the performance win of qp-trie is almost the same (worse when data set > is bigger). > > The better performance is due to two reasons: > (1) height of qp-trie is low due to the randomly-generated data > The top-most branch nodes in qp-trie are almost full and have 16 or 17 child > nodes. If using /proc/kallsyms as input data set, the number of child nodes of > top-most branch nodes are just about 2~4. > right, makes sense, dense data with high branching factor is best scenario for trie-based data structures > (2) large difference between used length of key > The max key size is 256, but the range of valid data length is [1, 255] and the > unused part is zeroed. For htab-lookup, it has to compare 256 bytes each time > during list traverse, but qp-trie only needs to compare the used length of key. yep, makes sense as well > > >> * non-zero drops is due to duplicated keys in generated keys. > >> > >> (2) more fine-grained lock in qp-trie > >> Now qp-trie is divided into 256 sub-trees by using the first character of > > character -> byte, it's not always strings, right? > Yes, qp-trie supports arbitrary bytes array as the key. > >> key and one sub-tree is protected one spinlock. From the data below, > >> although the update/delete speed of qp-trie is slower compare with hash > >> table, but it scales similar with hash table. So maybe 256-locks is a > >> good enough solution ? > >> > >> Strings in /proc/kallsyms > >> htab update (1 thread) 2.850 ± 0.129M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB) > >> htab update (2 thread) 4.363 ± 0.031M/s (drops 0.000 ± 0.000M/s mem 33.563 MiB) > >> htab update (4 thread) 6.306 ± 0.096M/s (drops 0.000 ± 0.000M/s mem 33.718 MiB) > >> htab update (8 thread) 6.611 ± 0.026M/s (drops 0.000 ± 0.000M/s mem 33.627 MiB) > >> htab update (16 thread) 6.390 ± 0.015M/s (drops 0.000 ± 0.000M/s mem 33.564 MiB) > >> qp-trie update (1 thread) 1.157 ± 0.099M/s (drops 0.000 ± 0.000M/s mem 18.333 MiB) > >> qp-trie update (2 thread) 1.920 ± 0.062M/s (drops 0.000 ± 0.000M/s mem 18.293 MiB) > >> qp-trie update (4 thread) 2.630 ± 0.050M/s (drops 0.000 ± 0.000M/s mem 18.472 MiB) > >> qp-trie update (8 thread) 3.171 ± 0.027M/s (drops 0.000 ± 0.000M/s mem 18.301 MiB) > >> qp-trie update (16 thread) 3.782 ± 0.036M/s (drops 0.000 ± 0.000M/s mem 19.040 MiB) > > qp-trie being slower than htab matches my expectation and what I > > observed when trying to use qp-trie with libbpf. But I think for a lot > > of cases memory vs CPU tradeoff, coupled with ability to dynamically > > grow qp-trie will make this data structure worthwhile. > > > > > >> (3) Improve memory efficiency further > >> When using strings in BTF string section as a data set for qp-trie, the > >> slab memory usage showed in cgroup memory.stats file is about 11MB for > >> qp-trie and 15MB for hash table as shown below. However the theoretical > >> memory usage for qp-trie is ~6.8MB (is ~4.9MB if removing "parent" & "rcu" > >> fields from qp_trie_branch) and the extra memory usage (about 38% of total > >> usage) mainly comes from internal fragment in slab (namely 2^n alignment > >> for allocation) and overhead in kmem-cgroup accounting. We can reduce the > >> internal fragment by creating separated kmem_cache for qp_trie_branch with > >> different child nodes, but not sure whether it is worthy or not. > >> > > Please CC Paul McKenney (paulmck@xxxxxxxxxx) for your non-RFC patch > > set and maybe he has some good idea how to avoid having rcu_head in > > each leaf node. Maybe some sort of per-CPU queue of to-be-rcu-freed > > elements, so that we don't have to keep 16 bytes in each leaf and > > branch node? > Will cc Paul. I found a head-less kfree_rcu() which only needs a pointer, but > its downside is the calling context needs to sleepable, so it doesn't fit. And > it seems that BPF specific memory allocator from Alexei can also help to remove > rcu_head in bpf map element, right ? Hm.. don't know, maybe? > >> And in order to prevent allocating a rcu_head for each leaf node, now only > >> branch node is RCU-freed, so when replacing a leaf node, a new branch node > >> and a new leaf node will be allocated instead of replacing the old leaf > >> node and RCU-freed the old leaf node. Also not sure whether or not it is > >> worthy. > >> > >> Strings in BTF string section (entries=115980): > >> htab lookup (1 thread) 9.889 ± 0.006M/s (drops 0.000 ± 0.000M/s mem 15.069 MiB) > >> qp-trie lookup (1 thread) 5.132 ± 0.002M/s (drops 0.000 ± 0.000M/s mem 10.721 MiB) > >> > >> All files under linux kernel source directory (entries=74359): > >> htab lookup (1 thread) 8.418 ± 0.077M/s (drops 0.000 ± 0.000M/s mem 14.207 MiB) > >> qp-trie lookup (1 thread) 4.966 ± 0.003M/s (drops 0.000 ± 0.000M/s mem 9.355 MiB) > >> > >> Domain names for Alexa top million web site (entries=1000000): > >> htab lookup (1 thread) 4.551 ± 0.043M/s (drops 0.000 ± 0.000M/s mem 190.761 MiB) > >> qp-trie lookup (1 thread) 2.804 ± 0.017M/s (drops 0.000 ± 0.000M/s mem 83.194 MiB) > >> > >> Comments and suggestions are always welcome. > >> > >> Regards, > >> Tao > >> > >> [0]: https://lore.kernel.org/bpf/CAEf4Bzb7keBS8vXgV5JZzwgNGgMV0X3_guQ_m9JW3X6fJBDpPQ@xxxxxxxxxxxxxx/ > >> [1]: https://github.com/cilium/cilium/blob/5145e31cd65db3361f6538d5f5f899440b769070/pkg/datapath/prefilter/prefilter.go#L123 > >> > >> Hou Tao (3): > >> bpf: Add support for qp-trie map > >> selftests/bpf: add a simple test for qp-trie > >> selftests/bpf: add benchmark for qp-trie map > >> > > > >> include/linux/bpf_types.h | 1 + > >> include/uapi/linux/bpf.h | 8 + > >> kernel/bpf/Makefile | 1 + > >> kernel/bpf/bpf_qp_trie.c | 1064 +++++++++++++++++ > >> tools/include/uapi/linux/bpf.h | 8 + > >> tools/testing/selftests/bpf/Makefile | 5 +- > >> tools/testing/selftests/bpf/bench.c | 10 + > >> .../selftests/bpf/benchs/bench_qp_trie.c | 499 ++++++++ > >> .../selftests/bpf/benchs/run_bench_qp_trie.sh | 55 + > >> .../selftests/bpf/prog_tests/str_key.c | 69 ++ > >> .../selftests/bpf/progs/qp_trie_bench.c | 218 ++++ > >> tools/testing/selftests/bpf/progs/str_key.c | 85 ++ > >> 12 files changed, 2022 insertions(+), 1 deletion(-) > >> create mode 100644 kernel/bpf/bpf_qp_trie.c > >> create mode 100644 tools/testing/selftests/bpf/benchs/bench_qp_trie.c > >> create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh > >> create mode 100644 tools/testing/selftests/bpf/prog_tests/str_key.c > >> create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_bench.c > >> create mode 100644 tools/testing/selftests/bpf/progs/str_key.c > >> > >> -- > >> 2.29.2 > >> > > . >