Re: [PATCH bpf-next v2 00/13] Add support for qp-trie with dynptr key

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Tue, 1 Nov 2022 20:07:56 +0800



Hi,

On 10/28/2022 2:52 AM, Andrii Nakryiko wrote:
> On Wed, Oct 19, 2022 at 10:01 AM Tony Finch <dot@xxxxxxxx> wrote:
>> Hello all,
>>
>> I have just found out about this qp-trie work, and I'm pleased to hear
>> that it is looking promising for you!
>>
> This is a very nice data structure, so thank you for doing a great job
> explaining it in your post!
Sorry for the late reply. Stilling digging into other problems. Also thanks Tony
for his job on qp.git.
>
>> I have a few very broad observations:
>>
>> The "q" in qp-trie doesn't have to stand for "quadbit". There's a tradeoff
>> between branch factor, maximum key length, and size of branch node. The
>> greater the branch factor, the fewer indirections needed to traverse the
>> tree; but if you go too wide then prefetching is less effective and branch
>> nodes get bigger. I found that 5 bits was the sweet spot (32 wide bitmap,
>> 30ish bit key length) - indexing 5 bit mouthfuls out of the key is HORRID
>> but it was measurably faster than 4 bits. 6 bits (64 bits of bitmap) grew
>> nodes from 16 bytes to 24 bytes, and it ended up slower.
>>
>> Your interior nodes are much bigger than mine, so you might find the
>> tradeoff is different. I encourage you to try it out.
parent field in qp_trie_branch is used to support non-recursive iteration and
rcu_head is used for RCU memory freeing.
> True, but I think for (at least initial) simplicity, sticking to
> half-bytes would simplify the code and let us figure out BPF and
> kernel-specific issues without having to worry about the correctness
> of the qp-trie core logic itself.
Agreed.
>
>> I saw there has been some discussion about locking and RCU. My current
>> project is integrating a qp-trie into BIND, with the aim of replacing its
>> old red-black tree for searching DNS records. It's based on a concurrent
>> qp-trie that I prototyped in NSD (a smaller and simpler DNS server than
>> BIND). My strategy is based on a custom allocator for interior nodes. This
>> has two main effects:
>>
>>   * Node references are now 32 bit indexes into the allocator's pool,
>>     instead of 64 bit pointers; nodes are 12 bytes instead of 16 bytes.
>>
>>   * The allocator supports copy-on-write and safe memory reclamation with
>>     a fairly small overhead, 3 x 32 bit counters per memory chunk (each
>>     chunk is roughly page sized).
>>
>> I wrote some notes when the design was new, but things have changed since
>> then.
>>
>> https://dotat.at/@/2021-06-23-page-based-gc-for-qp-trie-rcu.html
>>
>> For memory reclamation the interior nodes get moved / compacted. It's a
>> kind of garbage collector, but easy-mode because the per-chunk counters
>> accurately indicate when compaction is worthwhile. I've written some notes
>> on my several failed GC experiments; the last / current attempt seems (by
>> and large) good enough.
>>
>> https://dotat.at/@/2022-06-22-compact-qp.html
>>
>> For exterior / leaf nodes, I'm using atomic refcounts to know when they
>> can be reclaimed. The caller is responsible for COWing its leaves when
>> necessary.
>>
>> Updates to the tree are transactional in style, and do not block readers:
>> a single writer gets the write mutex, makes whatever changes it needs
>> (copying as required), then commits by flipping the tree's root. After a
>> commit it can free unused chunks. (Compaction can be part of an update
>> transaction or a transaction of its own.)
>>
>> I'm currently using a reader-writer lock for the tree root, but I designed
>> it with liburcu in mind, while trying to keep things simple.
>>
>> This strategy is very heavily biased in favour of readers, which suits DNS
>> servers. I don't know enough about BPF to have any idea what kind of
>> update traffic you need to support.
> These are some nice ideas, I did a quick read on your latest blog
> posts, missed those updates since last time I checked your blog.
>
> One limitation that we have in the BPF world is that BPF programs can
> be run in extremely restrictive contexts (e.g., NMI), in which things
> that user-space can assume will almost always succeed (like memory
> allocation), are not allowed. We do have BPF-specific memory
> allocator, but even it can fail to allocate memory, depending on
> allocation patterns. So we need to think if this COW approach is
> acceptable. I'd love for Hou Tao to think about this and chime in,
> though, as he spent a lot of time thinking about particulars.
Current implementation of BPF_MAP_TYPE_QP_TRIE is already COWed. When adding or
deleting a leaf node, its parent interior node will be copied to a new interior
node, the pointer to the old parent node (in the grand-parent interior node)
will be updated by the new parent node, and the old parent node will be RCU-freed.
According to above description, COW in qp-trie means all nodes on the path from
the root node to the leaf node are COWed, so I think current COW implementation
is better for bpf map usage scenario. But I will check the qp-trie code in BIND
[0] later.

0:
https://gitlab.isc.org/isc-projects/bind9/-/commit/ecc555e6ec763c4f8f2495864ec08749202fff1a#65b4d67ce64e9195e41ac43d78af5156f9ebb779_0_553
> But very basically, ultimate memory and performance savings are
> perhaps less important in trying to fit qp-trie into BPF framework. We
> can iterate after with optimizations and improvements, but first we
> need to get the things correct and well-behaved.
Understand.
>
>> At the moment I am reworking and simplifying my transaction and
>> reclamation code and it's all very broken. I guess this isn't the best
>> possible time to compare notes on qp-trie variants, but I'm happy to hear
>> from others who have code and ideas to share.
> It would be great if you can lend your expertise in reviewing at least
> generic qp-trie parts, but also in helping to figure out the overall
> concurrency approach we can take in kernel/BPF land (depending on your
> familiarity with kernel specifics, of course).
>
> Thanks for offering the latest on qp-trie, exciting to see more
> production applications of qp-trie and that you are still actively
> working on this!
Yes, it would be great if Tony could help to review or co-design bpf qp-trie map.
>> --
>> Tony Finch  <dot@xxxxxxxx>  https://dotat.at/
>> Mull of Kintyre to Ardnamurchan Point: East or southeast 4 to 6,
>> increasing 6 to gale 8 for a time. Smooth or slight in eastern
>> shelter, otherwise slight or moderate. Rain or showers. Good,
>> occasionally poor.