On Fri, May 20, 2022 at 05:45:24PM +0200, Stefano Brivio wrote: > On Tue, 17 May 2022 14:57:09 +0200 > Stefano Brivio <sbrivio@xxxxxxxxxx> wrote: > > > On Mon, 16 May 2022 20:16:53 +0200 > > Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> wrote: > > > > > Hi Stefano, > > > > > > On Thu, May 12, 2022 at 08:34:21PM +0200, Stefano Brivio wrote: > > > > In the overlap detection performed as part of the insertion operation, > > > > we have to skip nodes that are not active in the current generation or > > > > expired. This was done by adding several conditions in overlap decision > > > > clauses, which made it hard to check for correctness, and debug the > > > > actual issue this patch is fixing. > > > > > > > > Consolidate these conditions into a single clause, so that we can > > > > proceed with debugging and fixing the following issue. > > > > > > > > Case b3. described in comments covers the insertion of a start > > > > element after an existing end, as a leaf. If all the descendants of > > > > a given element are inactive, functionally, for the purposes of > > > > overlap detection, we still have to treat this as a leaf, but we don't. > > > > > > > > If, in the same transaction, we remove a start element to the right, > > > > remove an end element to the left, and add a start element to the right > > > > of an existing, active end element, we don't hit case b3. For example: > > > > > > > > - existing intervals: 1-2, 3-5, 6-7 > > > > - transaction: delete 3-5, insert 4-5 > > > > > > > > node traversal might happen as follows: > > > > - 2 (active end) > > > > - 5 (inactive end) > > > > - 3 (inactive start) > > > > > > > > when we add 4 as start element, we should simply consider 2 as > > > > preceding end, and consider it as a leaf, because interval 3-5 has been > > > > deleted. If we don't, we'll report an overlap because we forgot about > > > > this. > > > > > > > > Add an additional flag which is set as we find an active end, and reset > > > > it if we find an active start (to the left). If we finish the traversal > > > > with this flag set, it means we need to functionally consider the > > > > previous active end as a leaf, and allow insertion instead of reporting > > > > overlap. > > > > > > I can still trigger EEXIST with deletion of existing interval. It > > > became harder to reproduce after this patch. > > > > > > After hitting EEXIST, if I do: > > > > > > echo "flush ruleset" > test.nft > > > nft list ruleset >> test.nft > > > > > > to dump the existing ruleset, then I run the delete element command > > > again to remove the interval and it works. Before this patch I could > > > reproduce it by reloading an existing ruleset dump. > > > > > > I'm running the script that I'm attaching manually, just one manual > > > invocation after another. > > > > Ouch, sorry for that. > > > > It looks like there's another case where inactive elements still affect > > overlap detection in an unexpected way... at least with the structure > > of this patch it should be easier to find, I'm looking into that now. > > ...what a mess. I could figure that part out (it was a case symmetric > to what this patch fixed, in this case resolving to case b5.) but > there's then another case (found by triggering a specific tree topology > with 0044interval_overlap_1) where we first add a start element, then > fail to add the end element because the start element is completely > "hidden" in the tree by inactive nodes. > > I tried to solve that with some backtracking, but that looks also > fragile. If I clean up the tree before insertion, instead, that will > only deal with expired nodes, not inactive nodes -- I can't erase > non-expired, inactive nodes because the API expects to find them at > some later point and call nft_rbtree_remove() on them. > > Now I'm trying another approach that looks more robust: instead of > descending the tree to find overlaps, just going through it in the same > way nft_rbtree_gc() does (linearly, node by node), marking the > value-wise closest points from left and right _valid_ nodes, and > applying the same reasoning. I need a bit more time for this, but it > looks way simpler. Insertion itself would keep working as it does now. > > In hindsight, it looks like it was a terrible idea to try to fix this > implementation. I really underestimated how bad this is. Functionally > speaking, it's not a red-black tree because: > > - we can't use it as a sorted binary tree, given that some elements > "don't matter" for some operations, or have another colour. We might > try to think of it as some other structure and rebuild from there > useful properties of sorted binary trees, but I'm not sure a > "red-brown-black" tree would have any other use making it worth of > any further research > > - end elements being represented as their value plus one also break > assumptions of sorted trees (this is the issue I'm actually facing > with backtracking) > > - left subtrees store keys greater than right subtrees, but this > looks consistent so it's just added fun and could be fixed > trivially (it's all reversed) > > By the way, I think we _should_ have similar issues in the lookup > function. Given that it's possible to build a tree with a subtree of at > least three levels with entirely non-valid nodes, I guess we can hide a > valid interval from the lookup too. It's just harder to test. Thanks for the detailed report. Another possibility? Maintain two trees, one for the existing generation (this is read-only) and another for the next generation (insertion/removals are performed on it), then swap them when commit happens? pipapo has similar requirements, currently it is relying on a workqueue to make some postprocess after the commit phase. At the expense of consuming more memory. > In the perspective of getting rid of it, I think we need: > > 1. some "introductory" documentation for nft_set_pipapo -- I just > got back to it (drawing some diagrams first...) > > 2. to understand if the performance gap in the few (maybe not > reasonable) cases where nft_set_rbtree matches faster than > nft_set_pipapo is acceptable. Summary: > https://lore.kernel.org/netfilter-devel/be7d4e51603633e7b88e4b0dde54b25a3e5a018b.1583598508.git.sbrivio@xxxxxxxxxx/ IIRC pipapo was not too far behind from rbtree for a few scenarios. > 3. a solution for https://bugzilla.netfilter.org/show_bug.cgi?id=1583, > it's an atomicity issue which has little to do with nft_set_pipapo > structures themselves, but I couldn't figure out the exact issue > yet. I'm struggling to find the time for it, so if somebody wants to > give it a try, I'd be more than happy to reassign it... OK, a different problem, related to pipapo. > Anyway, I'll post a different patch for nft_set_rbtree soon. Thanks.