Re: [PATCH nf-next v4 5/9] nf_tables: Add set type for arbitrary concatenation of ranges

Stefano Brivio <sbrivio@xxxxxxxxxx> · Mon, 10 Feb 2020 16:10:47 +0100

On Fri, 7 Feb 2020 12:23:08 +0100
Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> wrote:

> Hi Stefano,
> 
> A bit of late feedback on your new datastructure.
> 
> Did you tests with 8-bits grouping instead of 4-bits?

Yes, at the very beginning, not with the final implementation. It was
somewhat faster (on x86_64, I don't remember how much) for small
numbers of rules, then I thought we would use too much memory, because:

> Assuming a concatenation of 12 bytes (each field 4 bytes, hence 3
> fields):
> 
> * Using 4-bits groups: the number of buckets is 2^4 = 16 multiplied
>   by the bucket word (assuming one long word, 8 bytes, 64 pipapo
>   rules) is 16 * 8 = 128 bytes per group-row in the looking table. Then,
>   the number of group-rows is 8 given that 32 bits, then 32 / 4 = 8
>   group-rows.
> 
>   8 * 128 bytes = 1024 bytes per lookup table.
> 
>   Assuming 3 fields, then this is 1024 * 3 = 3072 bytes.
> 
> * Using 8-bits groups: 2^8 = 256, then 256 * 8 = 2048 bytes per
>   group-row. Then, 32 / 8 = 4 group-rows in total.
> 
>   4 * 2048 bytes = 8192 bytes per lookup table.
> 
>   Therefore, 3 * 8192 = 24576 bytes. Still rather small.
> 
> This is missing the mapping table that links the lookup tables in the
> memory counting. And I'm assuming that the number of pipapo rules in
> the lookup table fits into 64-bits bucket long word.

...the (reasonable?) worst case I wanted to cover was two IPv6
addresses, one port, one MAC address (in ipset terms
"net,net,port,mac"), with 2 ^ 16 unique, non-overlapping entries each
(or ranges expanding to that amount of rules), because that's what
(single, non-concatenated) ipset "bitmap" types can do.

Also ignoring the mapping table (it's "small"), with 4-bit buckets:

- for the IPv6 addresses, we have 16 buckets, each 2 ^ 16
  bits wide, and 32 groups (128 bits / 4 bits), that is, 8MiB in
  total

- for the MAC address, 16 buckets, each 2 ^ 16 bits wide, and 12
  groups, 1.5MiB

- for the port, 16 buckets, each 2 ^ 12 bits wide, 2 groups, 0.25MiB

that is, 9.75MiB.

With 8-bit buckets: we can just multiply everything by 8 (that is,
2 ^ 8 / 2 ^ 4 / 2, because we have 2 ^ (8 - 4) times the buckets, with
half the groups), 78MiB.

And that started feeling like "a lot". However, I'm probably overdoing
with the worst case -- this was just to explain what brought me to the
4-bit choice, now I start doubting about it.

> Anyway, my understanding is that the more bits you use for grouping,
> the larger the lookup table becomes.
> 
> Still, both look very small in terms of memory consumption for these
> days.
> 
> I'm just telling this because the C implementation can probably get
> better numbers at the cost of consuming more memory? Probably do this
> at some point?

Another topic is the additional amount of cachelines we would use. I
don't expect that effect to be visible, but I might be wrong.

So yes, I think it's definitely worth a try, thanks for the hint! I'll
try to look into this soon and test it on a few archs (something with
small cachelines, say MIPS r2k, would be worth checking, too).

We could even consider to dynamically adjust group size depending on
the set size, I don't know yet if that gets too convoluted.

> BTW, with a few more knobs it should be possible to integrate better
> this datastructure into the transaction infrastructure, this can be
> done incrementally.
> 
> More questions below.
> 
> On Wed, Jan 22, 2020 at 12:17:55AM +0100, Stefano Brivio wrote:
> [...]
> > diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c
> > new file mode 100644
> > index 000000000000..f0cb1e13af50
> > --- /dev/null
> > +++ b/net/netfilter/nft_set_pipapo.c  
> [...]
> > + *
> > + *       rule indices in current field: 0    1    2
> > + *       map to rules in next field:    0    1    1
> > + *
> > + *      the new result bitmap will be 0x02: rule 1 was set, and rule 1 will be
> > + *      set.
> > + *
> > + *      We can now extend this example to cover the second iteration of the step
> > + *      above (lookup and AND bitmap): assuming the port field is
> > + *      2048 < 0  0  5  0 >, with starting result bitmap 0x2, and lookup table
> > + *      for "port" field from pre-computation example:
> > + *
> > + * ::
> > + *
> > + *                     bucket
> > + *      group  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
> > + *        0   0,1
> > + *        1   0,1
> > + *        2                    0   1
> > + *        3   0,1
> > + *
> > + *       operations are: 0x2 & 0x3 [bucket 0] & 0x3 [bucket 0] & 0x2 [bucket 5]
> > + *       & 0x3 [bucket 0], resulting bitmap is 0x2.
> > + *
> > + *   - if this is the last field in the set, look up the value from the mapping
> > + *     array corresponding to the final result bitmap
> > + *
> > + *      Example: 0x2 resulting bitmap from 192.168.1.5:2048, mapping array for
> > + *      last field from insertion example:
> > + *
> > + * ::
> > + *
> > + *       rule indices in last field:    0    1
> > + *       map to elements:             0x42  0x66  
> 
> Should this be instead?
> 
>             rule indices in last field:    0    1
>             map to elements:             0x66  0x42
> 
> If the resulting bitmap is 0x2, then this is actually pointing to
> rule index 1 in this lookup table, that is the 2048.

Right! Good catch, thanks.

I swapped the values also in the "insertion" example above. For some
reason, this was correct in the equivalent example of the stand-alone
implementation:
	https://pipapo.lameexcu.se/pipapo/tree/pipapo.c#n162

> > + *      the matching element is at 0x42.  
> 
> Hence, the matching 0x42 element.
> 
> Otherwise, I don't understand how to interpret the "result bitmap". I
> thought this contains the matching pipapo rule index that is expressed
> as a bitmask.

Yes, that's correct. Let me know if you want me to send a patch or if
you'd rather fix it.

> [...]
> > +static int pipapo_refill(unsigned long *map, int len, int rules,
> > +			 unsigned long *dst, union nft_pipapo_map_bucket *mt,
> > +			 bool match_only)
> > +{
> > +	unsigned long bitset;
> > +	int k, ret = -1;
> > +
> > +	for (k = 0; k < len; k++) {
> > +		bitset = map[k];
> > +		while (bitset) {
> > +			unsigned long t = bitset & -bitset;
> > +			int r = __builtin_ctzl(bitset);
> > +			int i = k * BITS_PER_LONG + r;
> > +
> > +			if (unlikely(i >= rules)) {
> > +				map[k] = 0;
> > +				return -1;
> > +			}
> > +
> > +			if (unlikely(match_only)) {  
> 
> Not a big issue, but this branch holds true for the last field, my
> understanding for unlikely() is that it should be used for paths where
> 99.99...% is not going to happen. Not a big issue, just that when
> reading the code I got confused this is actually likely to happen.

You're right, I wanted to make sure we avoid branching for the "common"
case (this early return happens just once), but this is probably an
abuse. I'll look into a more acceptable way to achieve this.

> > +				bitmap_clear(map, i, 1);
> > +				return i;
> > +			}
> > +
> > +			ret = 0;
> > +
> > +			bitmap_set(dst, mt[i].to, mt[i].n);
> > +
> > +			bitset ^= t;
> > +		}
> > +		map[k] = 0;
> > +	}
> > +
> > +	return ret;  
> > +}>   

-- 
Stefano