On Fri, 7 Feb 2020 12:23:08 +0100 Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> wrote: > Hi Stefano, > > A bit of late feedback on your new datastructure. > > Did you tests with 8-bits grouping instead of 4-bits? Yes, at the very beginning, not with the final implementation. It was somewhat faster (on x86_64, I don't remember how much) for small numbers of rules, then I thought we would use too much memory, because: > Assuming a concatenation of 12 bytes (each field 4 bytes, hence 3 > fields): > > * Using 4-bits groups: the number of buckets is 2^4 = 16 multiplied > by the bucket word (assuming one long word, 8 bytes, 64 pipapo > rules) is 16 * 8 = 128 bytes per group-row in the looking table. Then, > the number of group-rows is 8 given that 32 bits, then 32 / 4 = 8 > group-rows. > > 8 * 128 bytes = 1024 bytes per lookup table. > > Assuming 3 fields, then this is 1024 * 3 = 3072 bytes. > > * Using 8-bits groups: 2^8 = 256, then 256 * 8 = 2048 bytes per > group-row. Then, 32 / 8 = 4 group-rows in total. > > 4 * 2048 bytes = 8192 bytes per lookup table. > > Therefore, 3 * 8192 = 24576 bytes. Still rather small. > > This is missing the mapping table that links the lookup tables in the > memory counting. And I'm assuming that the number of pipapo rules in > the lookup table fits into 64-bits bucket long word. ...the (reasonable?) worst case I wanted to cover was two IPv6 addresses, one port, one MAC address (in ipset terms "net,net,port,mac"), with 2 ^ 16 unique, non-overlapping entries each (or ranges expanding to that amount of rules), because that's what (single, non-concatenated) ipset "bitmap" types can do. Also ignoring the mapping table (it's "small"), with 4-bit buckets: - for the IPv6 addresses, we have 16 buckets, each 2 ^ 16 bits wide, and 32 groups (128 bits / 4 bits), that is, 8MiB in total - for the MAC address, 16 buckets, each 2 ^ 16 bits wide, and 12 groups, 1.5MiB - for the port, 16 buckets, each 2 ^ 12 bits wide, 2 groups, 0.25MiB that is, 9.75MiB. With 8-bit buckets: we can just multiply everything by 8 (that is, 2 ^ 8 / 2 ^ 4 / 2, because we have 2 ^ (8 - 4) times the buckets, with half the groups), 78MiB. And that started feeling like "a lot". However, I'm probably overdoing with the worst case -- this was just to explain what brought me to the 4-bit choice, now I start doubting about it. > Anyway, my understanding is that the more bits you use for grouping, > the larger the lookup table becomes. > > Still, both look very small in terms of memory consumption for these > days. > > I'm just telling this because the C implementation can probably get > better numbers at the cost of consuming more memory? Probably do this > at some point? Another topic is the additional amount of cachelines we would use. I don't expect that effect to be visible, but I might be wrong. So yes, I think it's definitely worth a try, thanks for the hint! I'll try to look into this soon and test it on a few archs (something with small cachelines, say MIPS r2k, would be worth checking, too). We could even consider to dynamically adjust group size depending on the set size, I don't know yet if that gets too convoluted. > BTW, with a few more knobs it should be possible to integrate better > this datastructure into the transaction infrastructure, this can be > done incrementally. > > More questions below. > > On Wed, Jan 22, 2020 at 12:17:55AM +0100, Stefano Brivio wrote: > [...] > > diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c > > new file mode 100644 > > index 000000000000..f0cb1e13af50 > > --- /dev/null > > +++ b/net/netfilter/nft_set_pipapo.c > [...] > > + * > > + * rule indices in current field: 0 1 2 > > + * map to rules in next field: 0 1 1 > > + * > > + * the new result bitmap will be 0x02: rule 1 was set, and rule 1 will be > > + * set. > > + * > > + * We can now extend this example to cover the second iteration of the step > > + * above (lookup and AND bitmap): assuming the port field is > > + * 2048 < 0 0 5 0 >, with starting result bitmap 0x2, and lookup table > > + * for "port" field from pre-computation example: > > + * > > + * :: > > + * > > + * bucket > > + * group 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > > + * 0 0,1 > > + * 1 0,1 > > + * 2 0 1 > > + * 3 0,1 > > + * > > + * operations are: 0x2 & 0x3 [bucket 0] & 0x3 [bucket 0] & 0x2 [bucket 5] > > + * & 0x3 [bucket 0], resulting bitmap is 0x2. > > + * > > + * - if this is the last field in the set, look up the value from the mapping > > + * array corresponding to the final result bitmap > > + * > > + * Example: 0x2 resulting bitmap from 192.168.1.5:2048, mapping array for > > + * last field from insertion example: > > + * > > + * :: > > + * > > + * rule indices in last field: 0 1 > > + * map to elements: 0x42 0x66 > > Should this be instead? > > rule indices in last field: 0 1 > map to elements: 0x66 0x42 > > If the resulting bitmap is 0x2, then this is actually pointing to > rule index 1 in this lookup table, that is the 2048. Right! Good catch, thanks. I swapped the values also in the "insertion" example above. For some reason, this was correct in the equivalent example of the stand-alone implementation: https://pipapo.lameexcu.se/pipapo/tree/pipapo.c#n162 > > + * the matching element is at 0x42. > > Hence, the matching 0x42 element. > > Otherwise, I don't understand how to interpret the "result bitmap". I > thought this contains the matching pipapo rule index that is expressed > as a bitmask. Yes, that's correct. Let me know if you want me to send a patch or if you'd rather fix it. > [...] > > +static int pipapo_refill(unsigned long *map, int len, int rules, > > + unsigned long *dst, union nft_pipapo_map_bucket *mt, > > + bool match_only) > > +{ > > + unsigned long bitset; > > + int k, ret = -1; > > + > > + for (k = 0; k < len; k++) { > > + bitset = map[k]; > > + while (bitset) { > > + unsigned long t = bitset & -bitset; > > + int r = __builtin_ctzl(bitset); > > + int i = k * BITS_PER_LONG + r; > > + > > + if (unlikely(i >= rules)) { > > + map[k] = 0; > > + return -1; > > + } > > + > > + if (unlikely(match_only)) { > > Not a big issue, but this branch holds true for the last field, my > understanding for unlikely() is that it should be used for paths where > 99.99...% is not going to happen. Not a big issue, just that when > reading the code I got confused this is actually likely to happen. You're right, I wanted to make sure we avoid branching for the "common" case (this early return happens just once), but this is probably an abuse. I'll look into a more acceptable way to achieve this. > > + bitmap_clear(map, i, 1); > > + return i; > > + } > > + > > + ret = 0; > > + > > + bitmap_set(dst, mt[i].to, mt[i].n); > > + > > + bitset ^= t; > > + } > > + map[k] = 0; > > + } > > + > > + return ret; > > +}> -- Stefano