On Sat, Sep 24, 2022 at 09:36:07PM +0800, Hou Tao wrote: > From: Hou Tao <houtao1@xxxxxxxxxx> > > Hi, > > The initial motivation for qp-trie map is to reduce memory usage for > string keys special those with large differencies in length as > discussed in [0]. And as a big-endian lexicographical ordered map, it > can also be used for any binary data with fixed or variable length. > > Now the basic functionality of qp-trie is ready, so posting it to get > more feedback or suggestions about qp-trie. Specially feedback > about the following questions: > > (1) Use cases for qp-trie > Andrii had proposed to re-implement lpm-trie by using qp-trie. The > advantage would be the speed up of lookup operations due to lower tree > depth of qp-trie and the performance of update may also increase. > But is there any other use cases for qp-trie ? Specially those cases > which need both ordering and memory efficiency or cases in which qp-trie > will have high fan-out and its lookup performance will be much better than > hash-table as shown below: > > Randomly-generated binary data (key size=255, max entries=16K, key length range:[1, 255]) > htab lookup (1 thread) 4.968 ± 0.009M/s (drops 0.002 ± 0.000M/s mem 8.169 MiB) > htab lookup (2 thread) 10.118 ± 0.010M/s (drops 0.007 ± 0.000M/s mem 8.169 MiB) > htab lookup (4 thread) 20.084 ± 0.022M/s (drops 0.007 ± 0.000M/s mem 8.168 MiB) > htab lookup (8 thread) 39.866 ± 0.047M/s (drops 0.010 ± 0.000M/s mem 8.168 MiB) > htab lookup (16 thread) 79.412 ± 0.065M/s (drops 0.049 ± 0.000M/s mem 8.169 MiB) > > qp-trie lookup (1 thread) 10.291 ± 0.007M/s (drops 0.004 ± 0.000M/s mem 4.899 MiB) > qp-trie lookup (2 thread) 20.797 ± 0.009M/s (drops 0.006 ± 0.000M/s mem 4.879 MiB) > qp-trie lookup (4 thread) 41.943 ± 0.019M/s (drops 0.015 ± 0.000M/s mem 4.262 MiB) > qp-trie lookup (8 thread) 81.985 ± 0.032M/s (drops 0.025 ± 0.000M/s mem 4.215 MiB) > qp-trie lookup (16 thread) 164.681 ± 0.051M/s (drops 0.050 ± 0.000M/s mem 4.261 MiB) > > * non-zero drops is due to duplicated keys in generated keys. > > (2) Improve update/delete performance for qp-trie > Now top-5 overheads in update/delete operations are: > > 21.23% bench [kernel.vmlinux] [k] qp_trie_update_elem > 13.98% bench [kernel.vmlinux] [k] qp_trie_delete_elem > 7.96% bench [kernel.vmlinux] [k] native_queued_spin_lock_slowpath > 5.16% bench [kernel.vmlinux] [k] memcpy_erms > 5.00% bench [kernel.vmlinux] [k] __kmalloc_node > > The top-2 overheads are due to memory access and atomic ops on > max_entries. I had tried memory prefetch but it didn't work out, maybe > I did it wrong. For subtree spinlock overhead, I also had tried the > hierarchical lock by using hand-over-hand lock scheme, but it didn't > scale well [1]. I will try to increase the number of subtrees from 256 > to 1024, 4096 or bigger and check whether it makes any difference. > > For atomic ops and kmalloc overhead, I think I can reuse the idea from > patchset "bpf: BPF specific memory allocator". I have given bpf_mem_alloc > a simple try and encounter some problems. One problem is that > immediate reuse of freed object in bpf memory allocator. Because qp-trie > uses bpf memory allocator to allocate and free qp_trie_branch, if > qp_trie_branch is reused immediately, the lookup procedure may oops due > to the incorrect content in qp_trie_branch. And another problem is the > size limitation in bpf_mem_alloc() is 4096. It may be a little small for > the total size of key size and value size, but maybe I can use two > separated bpf_mem_alloc for key and value. 4096 limit for key+value size would be an acceptable trade-off. With kptrs the user will be able to extend value to much bigger sizes while doing <= 4096 allocation at a time. Larger allocations are failing in production more often than not. Any algorithm relying on successful >= 4096 allocation is likely to fail. kvmalloc is a fallback that the kernel is using, but we're not there yet in bpf land. The benefits of bpf_mem_alloc in qp-trie would be huge though. qp-trie would work in all contexts including sleepable progs. As presented the use cases for qp-trie are quite limited. If I understand correctly the concern for not using bpf_mem_alloc is that qp_trie_branch can be reused. Can you provide an exact scenario that will casue issuses? Instead of call_rcu in qp_trie_branch_free (which will work only for regular progs and have high overhead as demonstrated by mem_alloc patches) the qp-trie freeing logic can scrub that element, so it's ready to be reused as another struct qp_trie_branch. I guess I'm missing how rcu protects this internal data structures of qp-trie. The rcu_read_lock of regular bpf prog helps to stay lock-less during lookup? Is that it? So to make qp-trie work in sleepable progs the algo would need to be changed to do both call_rcu and call_rcu_task_trace everywhere to protect these inner structs? call_rcu_task_trace can take long time. So qp_trie_branch-s may linger around. So quick update/delete (in sleepable with call_rcu_task_trace) may very well exhaust memory. With bpf_mem_alloc we don't have this issue since rcu_task_trace gp is observed only when freeing into global mem pool. Say qp-trie just uses bpf_mem_alloc for qp_trie_branch. What is the worst that can happen? qp_trie_lookup_elem will go into wrong path, but won't crash, right? Can we do hlist_nulls trick to address that? In other words bpf_mem_alloc reuse behavior is pretty much SLAB_TYPESAFE_BY_RCU. Many kernel data structures know how to deal with such object reuse. We can have a private bpf_mem_alloc here for qp_trie_branch-s only and construct a logic in a way that obj reuse is not problematic. Another alternative would be to add explicit rcu_read_lock in qp_trie_lookup_elem to protect qp_trie_branch during lookup while using bpf_mem_alloc for both qp_trie_branch and leaf nodes, but that's not a great solution either. It will allow qp-trie to be usable in sleepable, but use of call_rcu in update/delete will prevent qp-trie to be usable in tracing progs. Let's try to brainstorm how to make qp_trie_branch work like SLAB_TYPESAFE_BY_RCU. Other than this issue the patches look great. This new map would be awesome addition.