On Wed, Apr 13, 2022 at 6:03 PM Hou Tao <houtao1@xxxxxxxxxx> wrote: > > Hi, > > (I send my previous reply in HTML mode mistakenly and the mail list doesn't > receive it, so send it again in the plain text mode.) > > On 4/13/2022 12:09 PM, Andrii Nakryiko wrote: > > On Fri, Apr 8, 2022 at 8:08 PM Hou Tao <houtao1@xxxxxxxxxx> wrote: > >> Hi, > >> > >> On 4/7/2022 1:38 AM, Andrii Nakryiko wrote: > >>> On Thu, Mar 31, 2022 at 5:04 AM Hou Tao <houtao1@xxxxxxxxxx> wrote: > >>>> Hi, > >>>> > >>>> The initial motivation for the patchset is due to the suggestion of Alexei. > >>>> During the discuss of supporting of string key in hash-table, he saw the > >>>> space efficiency of ternary search tree under our early test and suggest > >>>> us to post it as a new bpf map [1]. > >>>> > >>>> Ternary search tree is a special trie where nodes are arranged in a > >>>> manner similar to binary search tree, but with up to three children > >>>> rather than two. The three children correpond to nodes whose value is > >>>> less than, equal to, and greater than the value of current node > >>>> respectively. > >>>> > >>>> In ternary search tree map, only the valid content of string is saved. > >>>> The trailing null byte and unused bytes after it are not saved. If there > >>>> are common prefixes between these strings, the prefix is only saved once. > >>>> Compared with other space optimized trie (e.g. HAT-trie, succinct trie), > >>>> the advantage of ternary search tree is simple and being writeable. > snip > >>>> > >>> Have you heard and tried qp-trie ([0]) by any chance? It is elegant > >>> and simple data structure. By all the available benchmarks it handily > >>> beats Red-Black trees in terms of memory usage and performance (though > >>> it of course depends on the data set, just like "memory compression" > >>> for ternary tree of yours depends on large set of common prefixes). > >>> qp-trie based BPF map seems (at least on paper) like a better > >>> general-purpose BPF map that is dynamically sized (avoiding current > >>> HASHMAP limitations) and stores keys in sorted order (and thus allows > >>> meaningful ordered iteration *and*, importantly for longest prefix > >>> match tree, allows efficient prefix matches). I did a quick experiment > >>> about a month ago trying to replace libbpf's internal use of hashmap > >>> with qp-trie for BTF string dedup and it was slightly slower than > >>> hashmap (not surprisingly, though, because libbpf over-sizes hashmap > >>> to avoid hash collisions and long chains in buckets), but it was still > >>> very decent even in that scenario. So I've been mulling the idea of > >>> implementing BPF map based on qp-trie elegant design and ideas, but > >>> can't find time to do this. > >> I have heard about it when check the space efficient of HAT trie [0], because > >> qp-trie needs to save the whole string key in the leaf node and its space > >> efficiency can not be better than ternary search tree for strings with common > >> prefix, so I did not consider about it. But I will do some benchmarks to check > >> the lookup performance and space efficiency of qp-trie and tst for string with > >> common prefix and strings without much common prefix. > >> If qp-trie is better, I think I can take the time to post it as a bpf map if you > >> are OK with that. > > You can probably always craft a data set where prefix sharing is so > > prevalent that space savings are very significant. But I think for a > > lot of real-world data it won't be as extreme and qp-trie might be > > very comparable (if not more memory-efficient) due to very compact > > node layout (which was the point of qp-trie). So I'd be really curious > > to see some comparisons. Would be great if you can try both! > It is a bit surprised to me that qp-trie has better memory efficiency (and > better lookup performance sometimes) compared with tst when there are not so > many common prefix between input strings (All tests below are conducted by > implementing the data structure in user-space): Thanks for a quick follow up and a benchmark! Low memory use is probably due to the minimal amount of pointers and extra metadata used per node in qp-trie. qp-trie approach is very lean, which is why I was recommending trying it out. > > * all unique symbols in /proc/kallsyms (171428 sorted symbols, 4.2MB in total) > > | qp-trie | tst | hash | > total memory used (MB) | 8.6 | 11.2 | 22.3 | > total update time (us) | 94623 | 87396 | 24477 | > total lookup time (us) | 50681 | 67395 | 22842 | > > * all strings in BTF string area (115980 unsorted strings, 2MB in total) > > | qp-trie | tst | hash | > total memory used (MB) | 5.0 | 7.3 | 13.5 | > total update time (us) | 67764 | 43484 | 16462 | > total lookup time (us) | 33732 | 31612 | 16462 | > > * all strings in BTF string area (115980 sorted string, 2MB in total) > > | qp-trie | tst | hash | > total memory used (MB) | 5.0 | 7.3 | 13.5 | > total update time (us) | 58745 | 57756 | 16210 | > total lookup time (us) | 26922 | 40850 | 16896 | > > * all files under Linux kernel (2.7MB, 74359 files generated by find utility > with "./" stripped) > > | qp-trie | tst | hash | > total memory used (MB) | 4.6 | 5.2 | 11.6 | > total update time (us) | 50422 | 28842 | 15255 | > total lookup time (us) | 22543 | 18252 | 11836 | Seems like lookup time is more or less on par (and for kallsyms noticeably faster), but update is sometimes a bit slower. I don't know if you did your own code or used open-source implementation, but keep in mind that performance of qp-trie very much depends on fast __builtin_popcount, so make sure you are using proper -march when compiling. See [0] [0] https://stackoverflow.com/questions/52161596/why-is-builtin-popcount-slower-than-my-own-bit-counting-function > > When the length of common prefix increases, ternary search tree becomes better > than qp-trie. > > * all files under Linux kernel with a comm prefix (e.g. "/home/houtao") > > | qp-trie | tst | hash | > total memory used (MB) | 5.5 | 5.2 | 12.2 | > total update time (us) | 51558 | 29835 | 15345 | > total lookup time (us) | 23121 | 19638 | 11540 | > > Because the lengthy prefix is not so common, and for string map I think the > memory efficiency and lookup performance is more importance than update > performance, so maybe qp-trie is a better choice for string map ? Any suggestions ? > I'm biased :) But I like the idea of qp-trie as a general purpose ordered and dynamically sized BPF map. It makes no assumption about data being string-like and sharing common prefixes. It can be made to work just as fine with any array of bytes, making it very suitable as a generic lookup table map. Note that upstream implementation does assume zero-terminated strings and no key being a prefix of another key. But all that can be removed. For fixed-length keys this can never happen by construction, for variable-length keys (and we'll be able to support this finally with bpf_dynptr's help very soon), we can record length of the key in each leaf and use that during comparisons. Also note that qp-trie can be internally used by BPF_MAP_TYPE_LPM_TRIE very efficiently and speed it up considerable in the process (and especially to get rid of the global lock). So if you were to invest in a proper full-featured production implementation of a BPF map, I'd start with qp-trie. From available benchmarks it's both faster and more memory efficient than Red-Black trees, which could be an alternative underlying implementation of such ordered and "resizable" map. > Regards, > Tao > >> > >>> This prefix sharing is nice when you have a lot of long common > >>> prefixes, but I'm a bit skeptical that as a general-purpose BPF data > >>> structure it's going to be that beneficial. 192 bytes of common > >>> prefixes seems like a very unusual dataset :) > >> Yes. The case with common prefix I known is full file path. > >>> More specifically about TST implementation in your paches. One global > >>> per-map lock I think is a very big downside. We have LPM trie which is > >>> very slow in big part due to global lock. It might be possible to > >>> design more granular schema for TST, but this whole in-place splitting > >>> logic makes this harder. I think qp-trie can be locked in a granular > >>> fashion much more easily by having a "hand over hand" locking: lock > >>> parent, find child, lock child, unlock parent, move into child node. > >>> Something like that would be more scalable overall, especially if the > >>> access pattern is not focused on a narrow set of nodes. > >> Yes. The global lock is a problem but the splitting is not in-place. I will try > >> to figure out whether the lock can be more scalable after the benchmark test > >> between qp-trie and tst. > > Great, looking forward! > > > >> Regards, > >> Tao > >> > >> [0]: https://github.com/Tessil/hat-trie > >>> Anyways, I love data structures and this one is an interesting idea. > >>> But just my few cents of "production-readiness" for general-purpose > >>> data structures for BPF. > >>> > >>> [0] https://dotat.at/prog/qp/README.html > >>> > >>>> Regards, > >>>> Tao > >>>> > >>>> [1]: https://lore.kernel.org/bpf/CAADnVQJUJp3YBcpESwR3Q1U6GS1mBM=Vp-qYuQX7eZOaoLjdUA@xxxxxxxxxxxxxx/ > >>>> > >>>> Hou Tao (2): > >>>> bpf: Introduce ternary search tree for string key > >>>> selftests/bpf: add benchmark for ternary search tree map > >>>> > >>>> include/linux/bpf_types.h | 1 + > >>>> include/uapi/linux/bpf.h | 1 + > >>>> kernel/bpf/Makefile | 1 + > >>>> kernel/bpf/bpf_tst.c | 411 +++++++++++++++++ > >>>> tools/include/uapi/linux/bpf.h | 1 + > >>>> tools/testing/selftests/bpf/Makefile | 5 +- > >>>> tools/testing/selftests/bpf/bench.c | 6 + > >>>> .../selftests/bpf/benchs/bench_tst_map.c | 415 ++++++++++++++++++ > >>>> .../selftests/bpf/benchs/run_bench_tst.sh | 54 +++ > >>>> tools/testing/selftests/bpf/progs/tst_bench.c | 70 +++ > >>>> 10 files changed, 964 insertions(+), 1 deletion(-) > >>>> create mode 100644 kernel/bpf/bpf_tst.c > >>>> create mode 100644 tools/testing/selftests/bpf/benchs/bench_tst_map.c > >>>> create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_tst.sh > >>>> create mode 100644 tools/testing/selftests/bpf/progs/tst_bench.c > >>>> > >>>> -- > >>>> 2.31.1 > >>>> > >>> . > > . >