On Wed, Jun 8, 2022 at 2:00 AM Hou Tao <houtao1@xxxxxxxxxx> wrote: > > Hi, > > On 4/27/2022 11:57 AM, Andrii Nakryiko wrote: > > On Tue, Apr 26, 2022 at 1:03 AM Hou Tao <houtao1@xxxxxxxxxx> wrote: > snip > >>> I'm biased :) But I like the idea of qp-trie as a general purpose > >>> ordered and dynamically sized BPF map. It makes no assumption about > >>> data being string-like and sharing common prefixes. It can be made to > >>> work just as fine with any array of bytes, making it very suitable as > >>> a generic lookup table map. Note that upstream implementation does > >>> assume zero-terminated strings and no key being a prefix of another > >>> key. But all that can be removed. For fixed-length keys this can never > >>> happen by construction, for variable-length keys (and we'll be able to > >>> support this finally with bpf_dynptr's help very soon), we can record > >>> length of the key in each leaf and use that during comparisons. > >> Using the trailing zero byte to make sure no key will be a prefix of another is > >> simple, but I will check whether or not there is other way to make the bytes > >> array key work out. Alexei had suggest me to use the length of key as part of > >> key just like bpf_lpm_trie_key does, maybe i can try it first. > > Yeah, using key length as part of the key during comparison is what I > > meant as well. I didn't mean to aritificially add trailing zero (this > > idea doesn't work for arbitrary binary data). > > > >>> Also note that qp-trie can be internally used by BPF_MAP_TYPE_LPM_TRIE > >>> very efficiently and speed it up considerable in the process (and > >>> especially to get rid of the global lock). > >>> > >>> So if you were to invest in a proper full-featured production > >>> implementation of a BPF map, I'd start with qp-trie. From available > >>> benchmarks it's both faster and more memory efficient than Red-Black > >>> trees, which could be an alternative underlying implementation of such > >>> ordered and "resizable" map. > Recently I tried to add concurrent update and deletion support for qp-trie, and > found out that hand-over-hand lock scheme may don't work for qp-trie update. The > short explanation is that update procedure needs traverse qp-trie twice and the > position tuple got in the first pass may be stale due to concurrent updates > occurred in the second pass. The detailed explanation is shown below. > > To reduce space usage for qp-trie, there is no common prefix saved in branch > node, so the update of qp-trie needs to traversing qp-trie to find the most > similar key firstly, comparing with it to get a (index, flag,) tuple for the > leaf key, then traversing qp-trie again to find the insert position by using the > tuple. The problem is that, the position tuple may be stale due to concurrent > updates occurred in different branch. Considering the following case: > > When inserting "aa_bind_mount" and "aa_buffers_lock" concurrently into the > following qp-trie. The most similar key for "aa_bind_mount" is leaf X, and for > "aa_buffers_lock" it is leaf Y. The calculated index tuple for both new keys are > the same: (index=3, flag=2). Assuming "aa_bind_mount" is inserted firstly, so > when inserting "aa_buffers_lock", the correct index will be (index=4, flag=1) > instead of (index=3, flag=2) and the result will be incorrect. > > branch: index 1 flags 1 bitmap 0x00088 > * leaf: a.81577 0 > * branch: index 4 flags 1 bitmap 0x00180 > * * branch: index 4 flags 2 bitmap 0x02080 > * * * leaf: aa_af_perm 1 (leaf X, for aa_bind_mount) > * * branch: index 4 flags 2 bitmap 0x00052 > * * * leaf: aa_apply_modes_to_perms 6 > * * * leaf: aa_asprint_hashstr 7 (leaf Y, for aa_buffers_lock) > > In order to get a correct position tuple, the intuitive solution is adding > common prefix in branch node and letting update procedure to find the insertion > position by comparing with the prefix, so it only needs to traverse qp-trie once > and hand-over-hand locking scheme can work. I plan to work on qp-trie with > common prefix and will update its memory usage and concurrent update/insert > speed in this thread once the demo is ready. So any more suggestions ? > Yeah, that sucks. I'm not sure I completely understand the common prefix solution and whether it will still be qp-trie after that, but if you try that, it would be interesting to learn about the results you get! I think in the worst case we'll have to do tree-wide lock, perhaps maybe having an initial 256-way root node for first byte, and each of 256 subtrees could have their own lock. Alternatively we can do optimistic lockless lookup (in the hope to find a match), but if that fails - take tree-wide lock, and perform the search and insertion again. This will favor lookup hits, obviously, but hopefully that's going to be a common use case where keys are mostly matching. WDYT? > Regards, > Tao > > >> Thanks for your suggestions. I will give it a try. > > Awesome! > > > >> Regards, > >> Tao > >> > >>>> Regards, > >>>> Tao > >>>>>>> This prefix sharing is nice when you have a lot of long common > >>>>>>> prefixes, but I'm a bit skeptical that as a general-purpose BPF data > >>>>>>> structure it's going to be that beneficial. 192 bytes of common > >>>>>>> prefixes seems like a very unusual dataset :) > >>>>>> Yes. The case with common prefix I known is full file path. > >>>>>>> More specifically about TST implementation in your paches. One global > >>>>>>> per-map lock I think is a very big downside. We have LPM trie which is > >>>>>>> very slow in big part due to global lock. It might be possible to > >>>>>>> design more granular schema for TST, but this whole in-place splitting > >>>>>>> logic makes this harder. I think qp-trie can be locked in a granular > >>>>>>> fashion much more easily by having a "hand over hand" locking: lock > >>>>>>> parent, find child, lock child, unlock parent, move into child node. > >>>>>>> Something like that would be more scalable overall, especially if the > >>>>>>> access pattern is not focused on a narrow set of nodes. > >>>>>> Yes. The global lock is a problem but the splitting is not in-place. I will try > >>>>>> to figure out whether the lock can be more scalable after the benchmark test > >>>>>> between qp-trie and tst. > >>>>> Great, looking forward! > >>>>> > >>>>>> Regards, > >>>>>> Tao > >>>>>> > >>>>>> [0]: https://github.com/Tessil/hat-trie > >>>>>>> Anyways, I love data structures and this one is an interesting idea. > >>>>>>> But just my few cents of "production-readiness" for general-purpose > >>>>>>> data structures for BPF. > >>>>>>> > >>>>>>> [0] https://dotat.at/prog/qp/README.html > >>>>>>> > >>>>>>>> Regards, > >>>>>>>> Tao > >>>>>>>> > >>>>>>>> [1]: https://lore.kernel.org/bpf/CAADnVQJUJp3YBcpESwR3Q1U6GS1mBM=Vp-qYuQX7eZOaoLjdUA@xxxxxxxxxxxxxx/ > >>>>>>>> > >>>>>>>> Hou Tao (2): > >>>>>>>> bpf: Introduce ternary search tree for string key > >>>>>>>> selftests/bpf: add benchmark for ternary search tree map > >>>>>>>> > >>>>>>>> include/linux/bpf_types.h | 1 + > >>>>>>>> include/uapi/linux/bpf.h | 1 + > >>>>>>>> kernel/bpf/Makefile | 1 + > >>>>>>>> kernel/bpf/bpf_tst.c | 411 +++++++++++++++++ > >>>>>>>> tools/include/uapi/linux/bpf.h | 1 + > >>>>>>>> tools/testing/selftests/bpf/Makefile | 5 +- > >>>>>>>> tools/testing/selftests/bpf/bench.c | 6 + > >>>>>>>> .../selftests/bpf/benchs/bench_tst_map.c | 415 ++++++++++++++++++ > >>>>>>>> .../selftests/bpf/benchs/run_bench_tst.sh | 54 +++ > >>>>>>>> tools/testing/selftests/bpf/progs/tst_bench.c | 70 +++ > >>>>>>>> 10 files changed, 964 insertions(+), 1 deletion(-) > >>>>>>>> create mode 100644 kernel/bpf/bpf_tst.c > >>>>>>>> create mode 100644 tools/testing/selftests/bpf/benchs/bench_tst_map.c > >>>>>>>> create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_tst.sh > >>>>>>>> create mode 100644 tools/testing/selftests/bpf/progs/tst_bench.c > >>>>>>>> > >>>>>>>> -- > >>>>>>>> 2.31.1 > >>>>>>>> > >>>>>>> . > >>>>> . > >>> . > > . >