Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 27 Oct 2022 11:52:54 -0700

On Wed, Oct 19, 2022 at 10:01 AM Tony Finch <dot@xxxxxxxx> wrote:
>
> Hello all,
>
> I have just found out about this qp-trie work, and I'm pleased to hear
> that it is looking promising for you!
>

This is a very nice data structure, so thank you for doing a great job
explaining it in your post!

> I have a few very broad observations:
>
> The "q" in qp-trie doesn't have to stand for "quadbit". There's a tradeoff
> between branch factor, maximum key length, and size of branch node. The
> greater the branch factor, the fewer indirections needed to traverse the
> tree; but if you go too wide then prefetching is less effective and branch
> nodes get bigger. I found that 5 bits was the sweet spot (32 wide bitmap,
> 30ish bit key length) - indexing 5 bit mouthfuls out of the key is HORRID
> but it was measurably faster than 4 bits. 6 bits (64 bits of bitmap) grew
> nodes from 16 bytes to 24 bytes, and it ended up slower.
>
> Your interior nodes are much bigger than mine, so you might find the
> tradeoff is different. I encourage you to try it out.

True, but I think for (at least initial) simplicity, sticking to
half-bytes would simplify the code and let us figure out BPF and
kernel-specific issues without having to worry about the correctness
of the qp-trie core logic itself.

>
> I saw there has been some discussion about locking and RCU. My current
> project is integrating a qp-trie into BIND, with the aim of replacing its
> old red-black tree for searching DNS records. It's based on a concurrent
> qp-trie that I prototyped in NSD (a smaller and simpler DNS server than
> BIND). My strategy is based on a custom allocator for interior nodes. This
> has two main effects:
>
>   * Node references are now 32 bit indexes into the allocator's pool,
>     instead of 64 bit pointers; nodes are 12 bytes instead of 16 bytes.
>
>   * The allocator supports copy-on-write and safe memory reclamation with
>     a fairly small overhead, 3 x 32 bit counters per memory chunk (each
>     chunk is roughly page sized).
>
> I wrote some notes when the design was new, but things have changed since
> then.
>
> https://dotat.at/@/2021-06-23-page-based-gc-for-qp-trie-rcu.html
>
> For memory reclamation the interior nodes get moved / compacted. It's a
> kind of garbage collector, but easy-mode because the per-chunk counters
> accurately indicate when compaction is worthwhile. I've written some notes
> on my several failed GC experiments; the last / current attempt seems (by
> and large) good enough.
>
> https://dotat.at/@/2022-06-22-compact-qp.html
>
> For exterior / leaf nodes, I'm using atomic refcounts to know when they
> can be reclaimed. The caller is responsible for COWing its leaves when
> necessary.
>
> Updates to the tree are transactional in style, and do not block readers:
> a single writer gets the write mutex, makes whatever changes it needs
> (copying as required), then commits by flipping the tree's root. After a
> commit it can free unused chunks. (Compaction can be part of an update
> transaction or a transaction of its own.)
>
> I'm currently using a reader-writer lock for the tree root, but I designed
> it with liburcu in mind, while trying to keep things simple.
>
> This strategy is very heavily biased in favour of readers, which suits DNS
> servers. I don't know enough about BPF to have any idea what kind of
> update traffic you need to support.

These are some nice ideas, I did a quick read on your latest blog
posts, missed those updates since last time I checked your blog.

One limitation that we have in the BPF world is that BPF programs can
be run in extremely restrictive contexts (e.g., NMI), in which things
that user-space can assume will almost always succeed (like memory
allocation), are not allowed. We do have BPF-specific memory
allocator, but even it can fail to allocate memory, depending on
allocation patterns. So we need to think if this COW approach is
acceptable. I'd love for Hou Tao to think about this and chime in,
though, as he spent a lot of time thinking about particulars.

But very basically, ultimate memory and performance savings are
perhaps less important in trying to fit qp-trie into BPF framework. We
can iterate after with optimizations and improvements, but first we
need to get the things correct and well-behaved.

>
> At the moment I am reworking and simplifying my transaction and
> reclamation code and it's all very broken. I guess this isn't the best
> possible time to compare notes on qp-trie variants, but I'm happy to hear
> from others who have code and ideas to share.

It would be great if you can lend your expertise in reviewing at least
generic qp-trie parts, but also in helping to figure out the overall
concurrency approach we can take in kernel/BPF land (depending on your
familiarity with kernel specifics, of course).

Thanks for offering the latest on qp-trie, exciting to see more
production applications of qp-trie and that you are still actively
working on this!

>
> --
> Tony Finch  <dot@xxxxxxxx>  https://dotat.at/
> Mull of Kintyre to Ardnamurchan Point: East or southeast 4 to 6,
> increasing 6 to gale 8 for a time. Smooth or slight in eastern
> shelter, otherwise slight or moderate. Rain or showers. Good,
> occasionally poor.