From: Hou Tao <houtao1@xxxxxxxxxx> Hi, The initial motivation for qp-trie map is to reduce memory usage for string keys special those with large differencies in length as discussed in [0]. And as a big-endian lexicographical ordered map, it can also be used for any binary data with fixed or variable length. Now the basic functionality of qp-trie is ready, so posting it to get more feedback or suggestions about qp-trie. Specially feedback about the following questions: (1) Use cases for qp-trie Andrii had proposed to re-implement lpm-trie by using qp-trie. The advantage would be the speed up of lookup operations due to lower tree depth of qp-trie and the performance of update may also increase. But is there any other use cases for qp-trie ? Specially those cases which need both ordering and memory efficiency or cases in which qp-trie will have high fan-out and its lookup performance will be much better than hash-table as shown below: Randomly-generated binary data (key size=255, max entries=16K, key length range:[1, 255]) htab lookup (1 thread) 4.968 ± 0.009M/s (drops 0.002 ± 0.000M/s mem 8.169 MiB) htab lookup (2 thread) 10.118 ± 0.010M/s (drops 0.007 ± 0.000M/s mem 8.169 MiB) htab lookup (4 thread) 20.084 ± 0.022M/s (drops 0.007 ± 0.000M/s mem 8.168 MiB) htab lookup (8 thread) 39.866 ± 0.047M/s (drops 0.010 ± 0.000M/s mem 8.168 MiB) htab lookup (16 thread) 79.412 ± 0.065M/s (drops 0.049 ± 0.000M/s mem 8.169 MiB) qp-trie lookup (1 thread) 10.291 ± 0.007M/s (drops 0.004 ± 0.000M/s mem 4.899 MiB) qp-trie lookup (2 thread) 20.797 ± 0.009M/s (drops 0.006 ± 0.000M/s mem 4.879 MiB) qp-trie lookup (4 thread) 41.943 ± 0.019M/s (drops 0.015 ± 0.000M/s mem 4.262 MiB) qp-trie lookup (8 thread) 81.985 ± 0.032M/s (drops 0.025 ± 0.000M/s mem 4.215 MiB) qp-trie lookup (16 thread) 164.681 ± 0.051M/s (drops 0.050 ± 0.000M/s mem 4.261 MiB) * non-zero drops is due to duplicated keys in generated keys. (2) Improve update/delete performance for qp-trie Now top-5 overheads in update/delete operations are: 21.23% bench [kernel.vmlinux] [k] qp_trie_update_elem 13.98% bench [kernel.vmlinux] [k] qp_trie_delete_elem 7.96% bench [kernel.vmlinux] [k] native_queued_spin_lock_slowpath 5.16% bench [kernel.vmlinux] [k] memcpy_erms 5.00% bench [kernel.vmlinux] [k] __kmalloc_node The top-2 overheads are due to memory access and atomic ops on max_entries. I had tried memory prefetch but it didn't work out, maybe I did it wrong. For subtree spinlock overhead, I also had tried the hierarchical lock by using hand-over-hand lock scheme, but it didn't scale well [1]. I will try to increase the number of subtrees from 256 to 1024, 4096 or bigger and check whether it makes any difference. For atomic ops and kmalloc overhead, I think I can reuse the idea from patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc a simple try and encounter some problems. One problem is that immediate reuse of freed object in bpf memory allocator. Because qp-trie uses bpf memory allocator to allocate and free qp_trie_branch, if qp_trie_branch is reused immediately, the lookup procedure may oops due to the incorrect content in qp_trie_branch. And another problem is the size limitation in bpf_mem_alloc() is 4096. It may be a little small for the total size of key size and value size, but maybe I can use two separated bpf_mem_alloc for key and value. String in BTF string sections (entries=115980, max size=71, threads=4) On hash-table Iter 0 ( 21.872us): hits 11.496M/s ( 2.874M/prod), drops 0.000M/s, total operations 11.496M/s Iter 1 ( 4.816us): hits 12.701M/s ( 3.175M/prod), drops 0.000M/s, total operations 12.701M/s Iter 2 (-11.006us): hits 12.443M/s ( 3.111M/prod), drops 0.000M/s, total operations 12.443M/s Iter 3 ( 2.628us): hits 11.223M/s ( 2.806M/prod), drops 0.000M/s, total operations 11.223M/s Slab: 22.225 MiB On qp-trie: Iter 0 ( 24.388us): hits 4.062M/s ( 1.016M/prod), drops 0.000M/s, total operations 4.062M/s Iter 1 ( 20.612us): hits 3.884M/s ( 0.971M/prod), drops 0.000M/s, total operations 3.884M/s Iter 2 (-21.533us): hits 4.046M/s ( 1.012M/prod), drops 0.000M/s, total operations 4.046M/s Iter 3 ( 2.090us): hits 3.971M/s ( 0.993M/prod), drops 0.000M/s, total operations 3.971M/s Slab: 10.849 MiB (3) Improve memory efficiency further When using strings in BTF string section as a data set for qp-trie, the slab memory usage showed in cgroup memory.stats file is about 11MB for qp-trie and 22MB for hash table as shown above. However the theoretical memory usage for qp-trie is ~6.8MB (is ~4.9MB if removing "parent" & "rcu" fields from qp_trie_branch) and the extra memory usage (about 38% of total usage) mainly comes from internal fragment in slab (namely 2^n alignment for allocation) and overhead in kmem-cgroup accounting. We can reduce the internal fragment by creating separated kmem_cache for qp_trie_branch with different child nodes, but not sure whether it is worth it. And in order to prevent allocating a rcu_head for each leaf node, now only branch node is RCU-freed, so when replacing a leaf node, a new branch node and a new leaf node will be allocated instead of replacing the old leaf node and RCU-freed the old leaf node. No sure whether or not there are ways to remove rcu_head completely from qp_trie_branch node. Headless kfree_rcu() doesn't need rcu_head but it might sleeps. It seems BPF memory allocator is a better choice as for now because it replaces rcu_read by llist_node. Sorted strings in BTF string sections (entries=115980) htab lookup (1 thread) 6.915 ± 0.029M/s (drops 0.000 ± 0.000M/s mem 22.216 MiB) qp-trie lookup (1 thread) 6.791 ± 0.005M/s (drops 0.000 ± 0.000M/s mem 11.273 MiB) All files under linux kernel source directory (entries=74359) htab lookup (1 thread) 7.978 ± 0.009M/s (drops 0.000 ± 0.000M/s mem 14.272 MiB) qp-trie lookup (1 thread) 5.521 ± 0.003M/s (drops 0.000 ± 0.000M/s mem 9.367 MiB) Domain names for Alexa top million web site (entries=1000000) htab lookup (1 thread) 3.148 ± 0.039M/s (drops 0.000 ± 0.000M/s mem 190.831 MiB) qp-trie lookup (1 thread) 2.374 ± 0.026M/s (drops 0.000 ± 0.000M/s mem 83.733 MiB) Comments and suggestions are always welcome. [0]: https://lore.kernel.org/bpf/CAEf4Bzb7keBS8vXgV5JZzwgNGgMV0X3_guQ_m9JW3X6fJBDpPQ@xxxxxxxxxxxxxx/ [1]: https://lore.kernel.org/bpf/db34696a-cbfe-16e8-6dd5-8174b97dcf1d@xxxxxxxxxx/ Change Log: v2: * Always use copy_to_user() in bpf_copy_to_dynptr_ukey() (from kernel test robot <lkp@xxxxxxxxx>) No-MMU ARM32 host doesn't support 8-bytes get_user(). * Remove BPF_F_DYNPTR_KEY and use the more extensible dynptr_key_off * Remove the unnecessary rcu_barrier in qp_trie_free() * Add a new helper bpf_dynptr_user_trim() for bpftool in libbpf * Support qp-trie map in bpftool * Add 2 more test_progs test cases for qp-trie map * Fix test_progs-no_alu32 failure for test_progs test case * Add tests for not-supported operations in test_maps v1: https://lore.kernel.org/bpf/20220917153125.2001645-1-houtao@xxxxxxxxxxxxxxx/ * Use bpf_dynptr as map key type instead of bpf_lpm_trie_key-styled key (Suggested by Andrii) * Fix build error and RCU related sparse errors reported by lkp robot * Copy the passed key firstly in qp_trie_update_elem(), because the content of passed key may change and may break the assumption in two-round lookup process during update. * Add the missing rcu_barrier in qp_trie_free() RFC: https://lore.kernel.org/bpf/20220726130005.3102470-1-houtao1@xxxxxxxxxx/ Hou Tao (13): bpf: Export bpf_dynptr_set_size() bpf: Add helper btf_find_dynptr() bpf: Support bpf_dynptr-typed map key in bpf syscall bpf: Support bpf_dynptr-typed map key in verifier libbpf: Add helpers for bpf_dynptr_user bpf: Add support for qp-trie map with dynptr key libbpf: Add probe support for BPF_MAP_TYPE_QP_TRIE bpftool: Add support for qp-trie map selftests/bpf: Add two new dynptr_fail cases for map key selftests/bpf: Move ENOTSUPP into bpf_util.h selftests/bpf: Add prog tests for qp-trie map selftests/bpf: Add benchmark for qp-trie map selftests/bpf: Add map tests for qp-trie by using bpf syscall include/linux/bpf.h | 11 +- include/linux/bpf_types.h | 1 + include/linux/btf.h | 1 + include/uapi/linux/bpf.h | 7 + kernel/bpf/Makefile | 1 + kernel/bpf/bpf_qp_trie.c | 1057 ++++++++++++++ kernel/bpf/btf.c | 13 + kernel/bpf/helpers.c | 9 +- kernel/bpf/map_in_map.c | 3 + kernel/bpf/syscall.c | 121 +- kernel/bpf/verifier.c | 17 +- .../bpf/bpftool/Documentation/bpftool-map.rst | 4 +- tools/bpf/bpftool/btf_dumper.c | 33 + tools/bpf/bpftool/map.c | 149 +- tools/include/uapi/linux/bpf.h | 7 + tools/lib/bpf/bpf.h | 29 + tools/lib/bpf/libbpf.c | 1 + tools/lib/bpf/libbpf_probes.c | 25 + tools/testing/selftests/bpf/Makefile | 5 +- tools/testing/selftests/bpf/bench.c | 10 + .../selftests/bpf/benchs/bench_qp_trie.c | 511 +++++++ .../selftests/bpf/benchs/run_bench_qp_trie.sh | 55 + tools/testing/selftests/bpf/bpf_util.h | 4 + .../selftests/bpf/map_tests/qp_trie_map.c | 1209 +++++++++++++++++ .../selftests/bpf/prog_tests/bpf_tcp_ca.c | 4 - .../testing/selftests/bpf/prog_tests/dynptr.c | 2 + .../selftests/bpf/prog_tests/lsm_cgroup.c | 4 - .../selftests/bpf/prog_tests/qp_trie_test.c | 214 +++ .../testing/selftests/bpf/progs/dynptr_fail.c | 43 + .../selftests/bpf/progs/qp_trie_bench.c | 236 ++++ .../selftests/bpf/progs/qp_trie_test.c | 200 +++ tools/testing/selftests/bpf/test_maps.c | 4 - 32 files changed, 3931 insertions(+), 59 deletions(-) create mode 100644 kernel/bpf/bpf_qp_trie.c create mode 100644 tools/testing/selftests/bpf/benchs/bench_qp_trie.c create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_qp_trie.sh create mode 100644 tools/testing/selftests/bpf/map_tests/qp_trie_map.c create mode 100644 tools/testing/selftests/bpf/prog_tests/qp_trie_test.c create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_bench.c create mode 100644 tools/testing/selftests/bpf/progs/qp_trie_test.c -- 2.29.2