On Fri, 14 Feb 2020 21:42:25 +0100 Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> wrote: > On Fri, Feb 14, 2020 at 08:42:13PM +0100, Stefano Brivio wrote: > > On Fri, 14 Feb 2020 19:16:34 +0100 > > Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> wrote: > [...] > > > You refer to a property that says that you can split a range into a > > > 2*n netmasks IIRC. Do you know what is the worst case when splitting > > > ranges? > > > > I'm not sure I got your question: that is exactly the worst case, i.e. > > we can have _up to_ 2 * n netmasks (hence rules) given a range of n > > bits. There's an additional upper bound on this, given by the address > > space, but single fields in a concatenation can overlap. > > > > For example, we can have up to 128 rules for an IPv6 range where at > > least 64 bits differ between the endpoints, and which would contain > > 2 ^ 64 addresses. Or, say, the IPv4 range 1.2.3.4 - 255.255.0.2 is > > expressed by 42 rules. > > > > By the way, 0.0.0.1 - 255.255.255.254 takes 62 rules, so we can > > *probably* say it's 2 * n - 2, but I don't have a formal proof for that. > > By "splitting" I was actually refering to "expanding", so you're > replying here to my worst-case range-to-rules expansion question. > > > I have a couple of ways in mind to get that down to n / 2, but it's not > > straightforward and it will take me some time (assuming it makes > > sense). For the n bound, we can introduce negations (proof in > > literature), and I have some kind of ugly prototype. For the n / 2 > > bound, I'd need some auxiliary data structure to keep insertion > > invertible. > > OK, so there is room to improve the "rule expansion" logic. I didn't > spend much time on that front yet. I wrote some (barely understandable if at all) notes here: https://pipapo.lameexcu.se/pipapo/tree/pipapo.c#n271 but I can't guarantee at this stage that it can all be implemented, or not necessarily in that way. Should you ever (try to) read it, take it as a rough, buggy draft. > > In practice, the "average" case is much less, but to define it we would > > first need to agree on what are the actual components of the > > multivariate distribution... size and start? Is it a Poisson > > distribution then? After spending some time on this and disagreeing > > with myself I'd shyly recommend to skip the topic. :) > > Yes, I agree to stick to something relatively simple and good is just > fine. > > > > There is no ipset set like this, but I agree usecase might happen. > > > > Actually, for ipset, a "net,port,net,port" type was proposed > > (netfilter-devel <20181216213039.399-1-oliver@xxxxxxxxxxxxxx>), but when > > József enquired about the intended use case, none was given. So maybe > > this whole "net,net,port,mac" story makes even less sense. > > Would it make sense to you to restrict pipapo to 3 fields until there > is someone with a usecase for this? I wouldn't. I'd rather implement the dynamic switch of bucket size if this is the alternative. I think the current version is rather generic, doesn't use too much memory, and is also reasonably tested, so I don't feel like taking working functionality away. > [...] > > > The per-cpu scratch index is only required if we cannot fit in the > > > "result bitmap" into the stack, right? > > > > Right. > > > > > Probably up to 256 bytes result bitmap in the stack is reasonable? > > > That makes 8192 pipapo rules. There will be no need to disable bh and > > > make use of the percpu scratchpad area in that case. > > > > Right -- the question is whether that would mean yet another > > implementation for the lookup function. > > This would need another lookup function that can be selected from > control plane path. The set size and the range-to-rule expansion > worst-case can tell us if it would fit into the stack. It's would be > just one extra lookup function for this case, ~80-100 LOC. Ah, you're right, it's not much. On the other hand, advantages seems to be: - dropping the local_bh_{disable,enable}() pair: I can't find numbers right now, but the difference in pps between v2 (when Florian pointed out it was needed) and v1 wasn't statistically relevant - avoiding the access to per-cpu allocated areas, here I have before/after numbers just for net,mac with 1000 entries on the AVX2 implementation (AMD 7351, one thread), it didn't look significant, 3.99Mpps instead of 3.90Mpps. Maybe it would make a difference when tested on more threads though (as a reference, same CPU, 10 threads, as of v2: 38.02Mpps) or with smaller L2 - memory usage (and how that affects locality on some archs), but if we do this for cases with small numbers of entries, it's probably not a big difference either Anyway, I guess it's worth a try -- I'll try to find some time for it. > > > If adjusting the code to deal with variable length "pipapo word" size > > > is not too convoluted, then you could just deal with the variable word > > > size from the insert / delete / get (slow) path and register one > > > lookup function for the version that is optimized for this pipapo word > > > size. > > > > Yes, I like this a lot -- we would also need one function to rebuild > > tables when the word size changes, but that sounds almost trivial. > > Changes for the slow path are actually rather simple. > > > > Still, I start doubting quite heavily that my original worst case is > > reasonable. If we stick to the one you mentioned, or even something in > > between, it makes no sense to keep 4-bit buckets. > > OK, then moving to 8-bits will probably remove a bit of code which is > dealing with "nibbles". Right now I have: 1 file changed, 16 insertions(+), 37 deletions(-) but okay, I'll give the dynamic switch a try, first. > > By the way, I went ahead and tried the 8-bit bucket version of the C > > implementation only, on my usual x86_64 box (one thread, AMD Epyc 7351). > > I think it's worth it: > > > > 4-bit 8-bit > > net,port > > 1000 entries 2304165pps 2901299pps > > port,net > > 100 entries 4131471pps 4751247pps > > net6,port > > 1000 entries 1092557pps 1651037pps > > port,proto > > 30000 entries 284147pps 449665pps > > net6,port,mac > > 10 entries 2082880pps 2762291pps > > net6,port,mac,proto > > 1000 entries 783810pps 1195823pps > > net,mac > > 1000 entries 1279122pps 1934003pps > > Assuming the same concatenation type, larger bucket size makes pps > drop in the C implementation? Uh, no, they go up with larger buckets, in the figures above we have +26%, +15%, +51%, +58%, +33%, +53%, +51%. > > I would now proceed extending this to the AVX2 implementation and (once > > I finish it) to the NEON one, I actually expect bigger gains there. > > Good. BTW, probably you can add a new NFT_SET_CLASS_JIT class that > comes becomes NFT_SET_CLASS_O_1 to make the set routine that selects > the set pick the jit version instead. It's not exactly JIT though :) it's hand-written assembly. It's coming up a bit nicer for NEON as I can use compiler intrinsics there. Maybe I didn't understand your point, but I don't think this is needed. In 9/9 of v4 I simply had, in nft_pipapo_avx2_estimate(): if (!boot_cpu_has(X86_FEATURE_AVX2) || !boot_cpu_has(X86_FEATURE_AVX)) return false; > > > Probably adding helper function to deal with pipapo words would help > > > to prepare for such update in the future. There is the ->estimate > > > function that allows to calculate for the best word size depending on > > > all the information this gets from the set definition. > > > > Hm, I really think it should be kind of painless to make this dynamic > > on insertion/deletion. > > OK, good. How would you like to proceed? Let me give a shot at the dynamic bucket size switching. If that looks reasonable, I'd proceed extending that to the AVX2 implementation, finish the NEON implementation including it right away, and prepare a series. If it doesn't, I'll report back :) -- Stefano