Hi, On 10/28/2022 2:52 AM, Andrii Nakryiko wrote: > On Wed, Oct 19, 2022 at 10:01 AM Tony Finch <dot@xxxxxxxx> wrote: >> Hello all, >> >> I have just found out about this qp-trie work, and I'm pleased to hear >> that it is looking promising for you! >> > This is a very nice data structure, so thank you for doing a great job > explaining it in your post! Sorry for the late reply. Stilling digging into other problems. Also thanks Tony for his job on qp.git. > >> I have a few very broad observations: >> >> The "q" in qp-trie doesn't have to stand for "quadbit". There's a tradeoff >> between branch factor, maximum key length, and size of branch node. The >> greater the branch factor, the fewer indirections needed to traverse the >> tree; but if you go too wide then prefetching is less effective and branch >> nodes get bigger. I found that 5 bits was the sweet spot (32 wide bitmap, >> 30ish bit key length) - indexing 5 bit mouthfuls out of the key is HORRID >> but it was measurably faster than 4 bits. 6 bits (64 bits of bitmap) grew >> nodes from 16 bytes to 24 bytes, and it ended up slower. >> >> Your interior nodes are much bigger than mine, so you might find the >> tradeoff is different. I encourage you to try it out. parent field in qp_trie_branch is used to support non-recursive iteration and rcu_head is used for RCU memory freeing. > True, but I think for (at least initial) simplicity, sticking to > half-bytes would simplify the code and let us figure out BPF and > kernel-specific issues without having to worry about the correctness > of the qp-trie core logic itself. Agreed. > >> I saw there has been some discussion about locking and RCU. My current >> project is integrating a qp-trie into BIND, with the aim of replacing its >> old red-black tree for searching DNS records. It's based on a concurrent >> qp-trie that I prototyped in NSD (a smaller and simpler DNS server than >> BIND). My strategy is based on a custom allocator for interior nodes. This >> has two main effects: >> >> * Node references are now 32 bit indexes into the allocator's pool, >> instead of 64 bit pointers; nodes are 12 bytes instead of 16 bytes. >> >> * The allocator supports copy-on-write and safe memory reclamation with >> a fairly small overhead, 3 x 32 bit counters per memory chunk (each >> chunk is roughly page sized). >> >> I wrote some notes when the design was new, but things have changed since >> then. >> >> https://dotat.at/@/2021-06-23-page-based-gc-for-qp-trie-rcu.html >> >> For memory reclamation the interior nodes get moved / compacted. It's a >> kind of garbage collector, but easy-mode because the per-chunk counters >> accurately indicate when compaction is worthwhile. I've written some notes >> on my several failed GC experiments; the last / current attempt seems (by >> and large) good enough. >> >> https://dotat.at/@/2022-06-22-compact-qp.html >> >> For exterior / leaf nodes, I'm using atomic refcounts to know when they >> can be reclaimed. The caller is responsible for COWing its leaves when >> necessary. >> >> Updates to the tree are transactional in style, and do not block readers: >> a single writer gets the write mutex, makes whatever changes it needs >> (copying as required), then commits by flipping the tree's root. After a >> commit it can free unused chunks. (Compaction can be part of an update >> transaction or a transaction of its own.) >> >> I'm currently using a reader-writer lock for the tree root, but I designed >> it with liburcu in mind, while trying to keep things simple. >> >> This strategy is very heavily biased in favour of readers, which suits DNS >> servers. I don't know enough about BPF to have any idea what kind of >> update traffic you need to support. > These are some nice ideas, I did a quick read on your latest blog > posts, missed those updates since last time I checked your blog. > > One limitation that we have in the BPF world is that BPF programs can > be run in extremely restrictive contexts (e.g., NMI), in which things > that user-space can assume will almost always succeed (like memory > allocation), are not allowed. We do have BPF-specific memory > allocator, but even it can fail to allocate memory, depending on > allocation patterns. So we need to think if this COW approach is > acceptable. I'd love for Hou Tao to think about this and chime in, > though, as he spent a lot of time thinking about particulars. Current implementation of BPF_MAP_TYPE_QP_TRIE is already COWed. When adding or deleting a leaf node, its parent interior node will be copied to a new interior node, the pointer to the old parent node (in the grand-parent interior node) will be updated by the new parent node, and the old parent node will be RCU-freed. According to above description, COW in qp-trie means all nodes on the path from the root node to the leaf node are COWed, so I think current COW implementation is better for bpf map usage scenario. But I will check the qp-trie code in BIND [0] later. 0: https://gitlab.isc.org/isc-projects/bind9/-/commit/ecc555e6ec763c4f8f2495864ec08749202fff1a#65b4d67ce64e9195e41ac43d78af5156f9ebb779_0_553 > But very basically, ultimate memory and performance savings are > perhaps less important in trying to fit qp-trie into BPF framework. We > can iterate after with optimizations and improvements, but first we > need to get the things correct and well-behaved. Understand. > >> At the moment I am reworking and simplifying my transaction and >> reclamation code and it's all very broken. I guess this isn't the best >> possible time to compare notes on qp-trie variants, but I'm happy to hear >> from others who have code and ideas to share. > It would be great if you can lend your expertise in reviewing at least > generic qp-trie parts, but also in helping to figure out the overall > concurrency approach we can take in kernel/BPF land (depending on your > familiarity with kernel specifics, of course). > > Thanks for offering the latest on qp-trie, exciting to see more > production applications of qp-trie and that you are still actively > working on this! Yes, it would be great if Tony could help to review or co-design bpf qp-trie map. >> -- >> Tony Finch <dot@xxxxxxxx> https://dotat.at/ >> Mull of Kintyre to Ardnamurchan Point: East or southeast 4 to 6, >> increasing 6 to gale 8 for a time. Smooth or slight in eastern >> shelter, otherwise slight or moderate. Rain or showers. Good, >> occasionally poor.