Hi Stefano, Just started reading, a few initial questions. On Fri, Nov 22, 2019 at 02:40:02PM +0100, Stefano Brivio wrote: [...] > diff --git a/include/net/netfilter/nf_tables_core.h b/include/net/netfilter/nf_tables_core.h > index 7281895fa6d9..9759257ec8ec 100644 > --- a/include/net/netfilter/nf_tables_core.h > +++ b/include/net/netfilter/nf_tables_core.h > @@ -74,6 +74,7 @@ extern struct nft_set_type nft_set_hash_type; > extern struct nft_set_type nft_set_hash_fast_type; > extern struct nft_set_type nft_set_rbtree_type; > extern struct nft_set_type nft_set_bitmap_type; > +extern struct nft_set_type nft_set_pipapo_type; > > struct nft_expr; > struct nft_regs; > diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile > index 5e9b2eb24349..3f572e5a975e 100644 > --- a/net/netfilter/Makefile > +++ b/net/netfilter/Makefile > @@ -81,7 +81,8 @@ nf_tables-objs := nf_tables_core.o nf_tables_api.o nft_chain_filter.o \ > nft_chain_route.o nf_tables_offload.o > > nf_tables_set-objs := nf_tables_set_core.o \ > - nft_set_hash.o nft_set_bitmap.o nft_set_rbtree.o > + nft_set_hash.o nft_set_bitmap.o nft_set_rbtree.o \ > + nft_set_pipapo.o > > obj-$(CONFIG_NF_TABLES) += nf_tables.o > obj-$(CONFIG_NF_TABLES_SET) += nf_tables_set.o > diff --git a/net/netfilter/nf_tables_set_core.c b/net/netfilter/nf_tables_set_core.c > index a9fce8d10051..586b621007eb 100644 > --- a/net/netfilter/nf_tables_set_core.c > +++ b/net/netfilter/nf_tables_set_core.c > @@ -9,12 +9,14 @@ static int __init nf_tables_set_module_init(void) > nft_register_set(&nft_set_rhash_type); > nft_register_set(&nft_set_bitmap_type); > nft_register_set(&nft_set_rbtree_type); > + nft_register_set(&nft_set_pipapo_type); > > return 0; > } > > static void __exit nf_tables_set_module_exit(void) > { > + nft_unregister_set(&nft_set_pipapo_type); > nft_unregister_set(&nft_set_rbtree_type); > nft_unregister_set(&nft_set_bitmap_type); > nft_unregister_set(&nft_set_rhash_type); > diff --git a/net/netfilter/nft_set_pipapo.c b/net/netfilter/nft_set_pipapo.c > new file mode 100644 > index 000000000000..3cad9aedc168 > --- /dev/null > +++ b/net/netfilter/nft_set_pipapo.c > @@ -0,0 +1,2197 @@ > +// SPDX-License-Identifier: GPL-2.0-only > + > +/* PIPAPO: PIle PAcket POlicies: set for arbitrary concatenations of ranges > + * > + * Copyright (c) 2019 Red Hat GmbH > + * > + * Author: Stefano Brivio <sbrivio@xxxxxxxxxx> > + */ > + > +/** > + * DOC: Theory of Operation > + * > + * > + * Problem > + * ------- > + * > + * Match packet bytes against entries composed of ranged or non-ranged packet > + * field specifiers, mapping them to arbitrary references. For example: > + * > + * :: > + * > + * --- fields ---> > + * | [net],[port],[net]... => [reference] > + * entries [net],[port],[net]... => [reference] > + * | [net],[port],[net]... => [reference] > + * V ... > + * > + * where [net] fields can be IP ranges or netmasks, and [port] fields are port > + * ranges. Arbitrary packet fields can be matched. > + * > + * > + * Algorithm Overview > + * ------------------ > + * > + * This algorithm is loosely inspired by [Ligatti 2010], and fundamentally > + * relies on the consideration that every contiguous range in a space of b bits > + * can be converted into b * 2 netmasks, from Theorem 3 in [Rottenstreich 2010], > + * as also illustrated in Section 9 of [Kogan 2014]. > + * > + * Classification against a number of entries, that require matching given bits > + * of a packet field, is performed by grouping those bits in sets of arbitrary > + * size, and classifying packet bits one group at a time. > + * > + * Example: > + * to match the source port (16 bits) of a packet, we can divide those 16 bits > + * in 4 groups of 4 bits each. Given the entry: > + * 0000 0001 0101 1001 > + * and a packet with source port: > + * 0000 0001 1010 1001 > + * first and second groups match, but the third doesn't. We conclude that the > + * packet doesn't match the given entry. > + * > + * Translate the set to a sequence of lookup tables, one per field. Each table > + * has two dimensions: bit groups to be matched for a single packet field, and > + * all the possible values of said groups (buckets). Input entries are > + * represented as one or more rules, depending on the number of composing > + * netmasks for the given field specifier, and a group match is indicated as a > + * set bit, with number corresponding to the rule index, in all the buckets > + * whose value matches the entry for a given group. > + * > + * Rules are mapped between fields through an array of x, n pairs, with each > + * item mapping a matched rule to one or more rules. The position of the pair in > + * the array indicates the matched rule to be mapped to the next field, x > + * indicates the first rule index in the next field, and n the amount of > + * next-field rules the current rule maps to. > + * > + * The mapping array for the last field maps to the desired references. > + * > + * To match, we perform table lookups using the values of grouped packet bits, > + * and use a sequence of bitwise operations to progressively evaluate rule > + * matching. > + * > + * A stand-alone, reference implementation, also including notes about possible > + * future optimisations, is available at: > + * https://pipapo.lameexcu.se/ > + * > + * Insertion > + * --------- > + * > + * - For each packet field: > + * > + * - divide the b packet bits we want to classify into groups of size t, > + * obtaining ceil(b / t) groups > + * > + * Example: match on destination IP address, with t = 4: 32 bits, 8 groups > + * of 4 bits each > + * > + * - allocate a lookup table with one column ("bucket") for each possible > + * value of a group, and with one row for each group > + * > + * Example: 8 groups, 2^4 buckets: > + * > + * :: > + * > + * bucket > + * group 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > + * 0 > + * 1 > + * 2 > + * 3 > + * 4 > + * 5 > + * 6 > + * 7 > + * > + * - map the bits we want to classify for the current field, for a given > + * entry, to a single rule for non-ranged and netmask set items, and to one > + * or multiple rules for ranges. Ranges are expanded to composing netmasks > + * by pipapo_expand(). > + * > + * Example: 2 entries, 10.0.0.5:1024 and 192.168.1.0-192.168.2.1:2048 > + * - rule #0: 10.0.0.5 > + * - rule #1: 192.168.1.0/24 > + * - rule #2: 192.168.2.0/31 > + * > + * - insert references to the rules in the lookup table, selecting buckets > + * according to bit values of a rule in the given group. This is done by > + * pipapo_insert(). > + * > + * Example: given: > + * - rule #0: 10.0.0.5 mapping to buckets > + * < 0 10 0 0 0 0 0 5 > > + * - rule #1: 192.168.1.0/24 mapping to buckets > + * < 12 0 10 8 0 1 < 0..15 > < 0..15 > > > + * - rule #2: 192.168.2.0/31 mapping to buckets > + * < 12 0 10 8 0 2 0 < 0..1 > > > + * > + * these bits are set in the lookup table: > + * > + * :: > + * > + * bucket > + * group 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > + * 0 0 1,2 > + * 1 1,2 0 > + * 2 0 1,2 > + * 3 0 1,2 > + * 4 0,1,2 > + * 5 0 1 2 > + * 6 0,1,2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 > + * 7 1,2 1,2 1 1 1 0,1 1 1 1 1 1 1 1 1 1 1 > + * > + * - if this is not the last field in the set, fill a mapping array that maps > + * rules from the lookup table to rules belonging to the same entry in > + * the next lookup table, done by pipapo_map(). > + * > + * Note that as rules map to contiguous ranges of rules, given how netmask > + * expansion and insertion is performed, &union nft_pipapo_map_bucket stores > + * this information as pairs of first rule index, rule count. > + * > + * Example: 2 entries, 10.0.0.5:1024 and 192.168.1.0-192.168.2.1:2048, > + * given lookup table #0 for field 0 (see example above): > + * > + * :: > + * > + * bucket > + * group 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > + * 0 0 1,2 > + * 1 1,2 0 > + * 2 0 1,2 > + * 3 0 1,2 > + * 4 0,1,2 > + * 5 0 1 2 > + * 6 0,1,2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 > + * 7 1,2 1,2 1 1 1 0,1 1 1 1 1 1 1 1 1 1 1 > + * > + * and lookup table #1 for field 1 with: > + * - rule #0: 1024 mapping to buckets > + * < 0 0 4 0 > > + * - rule #1: 2048 mapping to buckets > + * < 0 0 5 0 > > + * > + * :: > + * > + * bucket > + * group 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > + * 0 0,1 > + * 1 0,1 > + * 2 0 1 > + * 3 0,1 > + * > + * we need to map rules for 10.0.0.5 in lookup table #0 (rule #0) to 1024 > + * in lookup table #1 (rule #0) and rules for 192.168.1.0-192.168.2.1 > + * (rules #1, #2) to 2048 in lookup table #2 (rule #1): > + * > + * :: > + * > + * rule indices in current field: 0 1 2 > + * map to rules in next field: 0 1 1 > + * > + * - if this is the last field in the set, fill a mapping array that maps > + * rules from the last lookup table to element pointers, also done by > + * pipapo_map(). > + * > + * Note that, in this implementation, we have two elements (start, end) for > + * each entry. The pointer to the end element is stored in this array, and > + * the pointer to the start element is linked from it. > + * > + * Example: entry 10.0.0.5:1024 has a corresponding &struct nft_pipapo_elem > + * pointer, 0x66, and element for 192.168.1.0-192.168.2.1:2048 is at 0x42. > + * From the rules of lookup table #1 as mapped above: > + * > + * :: > + * > + * rule indices in last field: 0 1 > + * map to elements: 0x42 0x66 > + * > + * > + * Matching > + * -------- > + * > + * We use a result bitmap, with the size of a single lookup table bucket, to > + * represent the matching state that applies at every algorithm step. This is > + * done by pipapo_lookup(). > + * > + * - For each packet field: > + * > + * - start with an all-ones result bitmap (res_map in pipapo_lookup()) > + * > + * - perform a lookup into the table corresponding to the current field, > + * for each group, and at every group, AND the current result bitmap with > + * the value from the lookup table bucket > + * > + * :: > + * > + * Example: 192.168.1.5 < 12 0 10 8 0 1 0 5 >, with lookup table from > + * insertion examples. > + * Lookup table buckets are at least 3 bits wide, we'll assume 8 bits for > + * convenience in this example. Initial result bitmap is 0xff, the steps > + * below show the value of the result bitmap after each group is processed: > + * > + * bucket > + * group 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > + * 0 0 1,2 > + * result bitmap is now: 0xff & 0x6 [bucket 12] = 0x6 > + * > + * 1 1,2 0 > + * result bitmap is now: 0x6 & 0x6 [bucket 0] = 0x6 > + * > + * 2 0 1,2 > + * result bitmap is now: 0x6 & 0x6 [bucket 10] = 0x6 > + * > + * 3 0 1,2 > + * result bitmap is now: 0x6 & 0x6 [bucket 8] = 0x6 > + * > + * 4 0,1,2 > + * result bitmap is now: 0x6 & 0x7 [bucket 0] = 0x6 > + * > + * 5 0 1 2 > + * result bitmap is now: 0x6 & 0x2 [bucket 1] = 0x2 > + * > + * 6 0,1,2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 > + * result bitmap is now: 0x2 & 0x7 [bucket 0] = 0x2 > + * > + * 7 1,2 1,2 1 1 1 0,1 1 1 1 1 1 1 1 1 1 1 > + * final result bitmap for this field is: 0x2 & 0x3 [bucket 5] = 0x2 > + * > + * - at the next field, start with a new, all-zeroes result bitmap. For each > + * bit set in the previous result bitmap, fill the new result bitmap > + * (fill_map in pipapo_lookup()) with the rule indices from the > + * corresponding buckets of the mapping field for this field, done by > + * pipapo_refill() > + * > + * Example: with mapping table from insertion examples, with the current > + * result bitmap from the previous example, 0x02: > + * > + * :: > + * > + * rule indices in current field: 0 1 2 > + * map to rules in next field: 0 1 1 > + * > + * the new result bitmap will be 0x02: rule 1 was set, and rule 1 will be > + * set. > + * > + * We can now extend this example to cover the second iteration of the step > + * above (lookup and AND bitmap): assuming the port field is > + * 2048 < 0 0 5 0 >, with starting result bitmap 0x2, and lookup table > + * for "port" field from pre-computation example: > + * > + * :: > + * > + * bucket > + * group 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > + * 0 0,1 > + * 1 0,1 > + * 2 0 1 > + * 3 0,1 > + * > + * operations are: 0x2 & 0x3 [bucket 0] & 0x3 [bucket 0] & 0x2 [bucket 5] > + * & 0x3 [bucket 0], resulting bitmap is 0x2. > + * > + * - if this is the last field in the set, look up the value from the mapping > + * array corresponding to the final result bitmap > + * > + * Example: 0x2 resulting bitmap from 192.168.1.5:2048, mapping array for > + * last field from insertion example: > + * > + * :: > + * > + * rule indices in last field: 0 1 > + * map to elements: 0x42 0x66 > + * > + * the matching element is at 0x42. > + * > + * > + * References > + * ---------- > + * > + * [Ligatti 2010] > + * A Packet-classification Algorithm for Arbitrary Bitmask Rules, with > + * Automatic Time-space Tradeoffs > + * Jay Ligatti, Josh Kuhn, and Chris Gage. > + * Proceedings of the IEEE International Conference on Computer > + * Communication Networks (ICCCN), August 2010. > + * http://www.cse.usf.edu/~ligatti/papers/grouper-conf.pdf > + * > + * [Rottenstreich 2010] > + * Worst-Case TCAM Rule Expansion > + * Ori Rottenstreich and Isaac Keslassy. > + * 2010 Proceedings IEEE INFOCOM, San Diego, CA, 2010. > + * http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.212.4592&rep=rep1&type=pdf > + * > + * [Kogan 2014] > + * SAX-PAC (Scalable And eXpressive PAcket Classification) > + * Kirill Kogan, Sergey Nikolenko, Ori Rottenstreich, William Culhane, > + * and Patrick Eugster. > + * Proceedings of the 2014 ACM conference on SIGCOMM, August 2014. > + * http://www.sigcomm.org/sites/default/files/ccr/papers/2014/August/2619239-2626294.pdf > + */ > + > +#include <linux/kernel.h> > +#include <linux/init.h> > +#include <linux/log2.h> > +#include <linux/module.h> > +#include <linux/netlink.h> > +#include <linux/netfilter.h> > +#include <linux/netfilter/nf_tables.h> > +#include <net/netfilter/nf_tables_core.h> > +#include <uapi/linux/netfilter/nf_tables.h> > +#include <net/ipv6.h> /* For the maximum length of a field */ > +#include <linux/bitmap.h> > +#include <linux/bitops.h> > + > +/* Count of concatenated fields depends on count of 32-bit nftables registers */ > +#define NFT_PIPAPO_MAX_FIELDS NFT_REG32_COUNT > + > +/* Largest supported field size */ > +#define NFT_PIPAPO_MAX_BYTES (sizeof(struct in6_addr)) > +#define NFT_PIPAPO_MAX_BITS (NFT_PIPAPO_MAX_BYTES * BITS_PER_BYTE) > + > +/* Number of bits to be grouped together in lookup table buckets, arbitrary */ > +#define NFT_PIPAPO_GROUP_BITS 4 > +#define NFT_PIPAPO_GROUPS_PER_BYTE (BITS_PER_BYTE / NFT_PIPAPO_GROUP_BITS) > + > +/* Fields are padded to 32 bits in input registers */ > +#define NFT_PIPAPO_GROUPS_PADDED_SIZE(x) \ > + (round_up((x) / NFT_PIPAPO_GROUPS_PER_BYTE, sizeof(u32))) > +#define NFT_PIPAPO_GROUPS_PADDING(x) \ > + (NFT_PIPAPO_GROUPS_PADDED_SIZE((x)) - (x) / NFT_PIPAPO_GROUPS_PER_BYTE) > + > +/* Number of buckets, given by 2 ^ n, with n grouped bits */ > +#define NFT_PIPAPO_BUCKETS (1 << NFT_PIPAPO_GROUP_BITS) > + > +/* Each n-bit range maps to up to n * 2 rules */ > +#define NFT_PIPAPO_MAP_NBITS (const_ilog2(NFT_PIPAPO_MAX_BITS * 2)) > + > +/* Use the rest of mapping table buckets for rule indices, but it makes no sense > + * to exceed 32 bits > + */ > +#if BITS_PER_LONG == 64 > +#define NFT_PIPAPO_MAP_TOBITS 32 > +#else > +#define NFT_PIPAPO_MAP_TOBITS (BITS_PER_LONG - NFT_PIPAPO_MAP_NBITS) > +#endif > + > +/* ...which gives us the highest allowed index for a rule */ > +#define NFT_PIPAPO_RULE0_MAX ((1UL << (NFT_PIPAPO_MAP_TOBITS - 1)) \ > + - (1UL << NFT_PIPAPO_MAP_NBITS)) > + > +#define nft_pipapo_for_each_field(field, index, match) \ > + for ((field) = (match)->f, (index) = 0; \ > + (index) < (match)->field_count; \ > + (index)++, (field)++) > + > +/** > + * union nft_pipapo_map_bucket - Bucket of mapping table > + * @to: First rule number (in next field) this rule maps to > + * @n: Number of rules (in next field) this rule maps to > + * @e: If there's no next field, pointer to element this rule maps to > + */ > +union nft_pipapo_map_bucket { > + struct { > +#if BITS_PER_LONG == 64 > + static_assert(NFT_PIPAPO_MAP_TOBITS <= 32); > + u32 to; > + > + static_assert(NFT_PIPAPO_MAP_NBITS <= 32); > + u32 n; > +#else > + unsigned long to:NFT_PIPAPO_MAP_TOBITS; > + unsigned long n:NFT_PIPAPO_MAP_NBITS; > +#endif > + }; > + struct nft_pipapo_elem *e; > +}; > + > +/** > + * struct nft_pipapo_field - Lookup, mapping tables and related data for a field > + * @groups: Amount of 4-bit groups > + * @rules: Number of inserted rules > + * @bsize: Size of each bucket in lookup table, in longs > + * @lt: Lookup table: 'groups' rows of NFT_PIPAPO_BUCKETS buckets > + * @mt: Mapping table: one bucket per rule > + */ > +struct nft_pipapo_field { > + int groups; > + unsigned long rules; > + size_t bsize; > + unsigned long *lt; > + union nft_pipapo_map_bucket *mt; > +}; > + > +/** > + * struct nft_pipapo_match - Data used for lookup and matching > + * @field_count Amount of fields in set > + * @scratch: Preallocated per-CPU maps for partial matching results > + * @bsize_max: Maximum lookup table bucket size of all fields, in longs > + * @rcu Matching data is swapped on commits > + * @f: Fields, with lookup and mapping tables > + */ > +struct nft_pipapo_match { > + int field_count; > + unsigned long * __percpu *scratch; > + size_t bsize_max; > + struct rcu_head rcu; > + struct nft_pipapo_field f[0]; > +}; > + > +/* Current working bitmap index, toggled between field matches */ > +static DEFINE_PER_CPU(bool, nft_pipapo_scratch_index); > + > +/** > + * struct nft_pipapo - Representation of a set > + * @match: Currently in-use matching data > + * @clone: Copy where pending insertions and deletions are kept > + * @groups: Total amount of 4-bit groups for fields in this set > + * @width: Total bytes to be matched for one packet, including padding > + * @dirty: Working copy has pending insertions or deletions > + * @last_gc: Timestamp of last garbage collection run, jiffies > + * @start_data: Key data of start element for insertion > + * @start_elem: Start element for insertion > + */ > +struct nft_pipapo { > + struct nft_pipapo_match __rcu *match; > + struct nft_pipapo_match *clone; > + int groups; > + int width; > + bool dirty; > + unsigned long last_gc; > + u8 start_data[NFT_DATA_VALUE_MAXLEN * sizeof(u32)]; > + struct nft_pipapo_elem *start_elem; > +}; > + > +struct nft_pipapo_elem; > + > +/** > + * struct nft_pipapo_elem - API-facing representation of single set element > + * @start: Pointer to element that represents start of interval > + * @ext: nftables API extensions > + */ > +struct nft_pipapo_elem { > + struct nft_pipapo_elem *start; > + struct nft_set_ext ext; > +}; > + > +/** > + * pipapo_refill() - For each set bit, set bits from selected mapping table item > + * @map: Bitmap to be scanned for set bits > + * @len: Length of bitmap in longs > + * @rules: Number of rules in field > + * @dst: Destination bitmap > + * @mt: Mapping table containing bit set specifiers > + * @match_only: Find a single bit and return, don't fill > + * > + * Iteration over set bits with __builtin_ctzl(): Daniel Lemire, public domain. > + * > + * For each bit set in map, select the bucket from mapping table with index > + * corresponding to the position of the bit set. Use start bit and amount of > + * bits specified in bucket to fill region in dst. > + * > + * Return: -1 on no match, bit position on 'match_only', 0 otherwise. > + */ > +static int pipapo_refill(unsigned long *map, int len, int rules, > + unsigned long *dst, union nft_pipapo_map_bucket *mt, > + bool match_only) > +{ > + unsigned long bitset; > + int k, ret = -1; > + > + for (k = 0; k < len; k++) { > + bitset = map[k]; > + while (bitset) { > + unsigned long t = bitset & -bitset; > + int r = __builtin_ctzl(bitset); > + int i = k * BITS_PER_LONG + r; > + > + if (unlikely(i >= rules)) { > + map[k] = 0; > + return -1; > + } > + > + if (unlikely(match_only)) { > + bitmap_clear(map, i, 1); > + return i; > + } > + > + ret = 0; > + > + bitmap_set(dst, mt[i].to, mt[i].n); > + > + bitset ^= t; > + } > + map[k] = 0; > + } > + > + return ret; > +} > + > +/** > + * nft_pipapo_lookup() - Lookup function > + * @net: Network namespace > + * @set: nftables API set representation > + * @elem: nftables API element representation containing key data > + * @ext: nftables API extension pointer, filled with matching reference > + * > + * For more details, see DOC: Theory of Operation. > + * > + * Return: true on match, false otherwise. > + */ > +static bool nft_pipapo_lookup(const struct net *net, const struct nft_set *set, > + const u32 *key, const struct nft_set_ext **ext) > +{ > + struct nft_pipapo *priv = nft_set_priv(set); > + unsigned long *res_map, *fill_map; > + u8 genmask = nft_genmask_cur(net); > + const u8 *rp = (const u8 *)key; > + struct nft_pipapo_match *m; > + struct nft_pipapo_field *f; > + bool map_index; > + int i; > + > + local_bh_disable(); > + > + map_index = raw_cpu_read(nft_pipapo_scratch_index); > + > + m = rcu_dereference(priv->match); > + > + if (unlikely(!m || !*raw_cpu_ptr(m->scratch))) > + goto out; > + > + res_map = *raw_cpu_ptr(m->scratch) + (map_index ? m->bsize_max : 0); > + fill_map = *raw_cpu_ptr(m->scratch) + (map_index ? 0 : m->bsize_max); > + > + memset(res_map, 0xff, m->bsize_max * sizeof(*res_map)); > + > + nft_pipapo_for_each_field(f, i, m) { > + bool last = i == m->field_count - 1; > + unsigned long *lt = f->lt; > + int b, group; > + > + /* For each 4-bit group: select lookup table bucket depending on > + * packet bytes value, then AND bucket value > + */ > + for (group = 0; group < f->groups; group++) { > + u8 v; > + > + if (group % 2) { > + v = *rp & 0x0f; > + rp++; > + } else { > + v = *rp >> 4; > + } > + __bitmap_and(res_map, res_map, lt + v * f->bsize, > + f->bsize * BITS_PER_LONG); > + > + lt += f->bsize * NFT_PIPAPO_BUCKETS; > + } > + > + /* Now populate the bitmap for the next field, unless this is > + * the last field, in which case return the matched 'ext' > + * pointer if any. > + * > + * Now res_map contains the matching bitmap, and fill_map is the > + * bitmap for the next field. > + */ > +next_match: > + b = pipapo_refill(res_map, f->bsize, f->rules, fill_map, f->mt, > + last); > + if (b < 0) { > + raw_cpu_write(nft_pipapo_scratch_index, map_index); > + local_bh_enable(); > + > + return false; > + } > + > + if (last) { > + *ext = &f->mt[b].e->ext; > + if (unlikely(nft_set_elem_expired(*ext) || > + !nft_set_elem_active(*ext, genmask))) > + goto next_match; > + > + /* Last field: we're just returning the key without > + * filling the initial bitmap for the next field, so the > + * current inactive bitmap is clean and can be reused as > + * *next* bitmap (not initial) for the next packet. > + */ > + raw_cpu_write(nft_pipapo_scratch_index, map_index); > + local_bh_enable(); > + > + return true; > + } > + > + /* Swap bitmap indices: res_map is the initial bitmap for the > + * next field, and fill_map is guaranteed to be all-zeroes at > + * this point. > + */ > + map_index = !map_index; > + swap(res_map, fill_map); > + > + rp += NFT_PIPAPO_GROUPS_PADDING(f->groups); > + } > + > +out: > + local_bh_enable(); > + return false; > +} > + > +/** > + * pipapo_get() - Get matching start or end element reference given key data > + * @net: Network namespace > + * @set: nftables API set representation > + * @data: Key data to be matched against existing elements > + * @flags: If NFT_SET_ELEM_INTERVAL_END is passed, return the end element > + * > + * This is essentially the same as the lookup function, except that it matches > + * key data against the uncommitted copy and doesn't use preallocated maps for > + * bitmap results. > + * > + * Return: pointer to &struct nft_pipapo_elem on match, error pointer otherwise. > + */ > +static void *pipapo_get(const struct net *net, const struct nft_set *set, > + const u8 *data, unsigned int flags) > +{ > + struct nft_pipapo *priv = nft_set_priv(set); > + struct nft_pipapo_match *m = priv->clone; > + unsigned long *res_map, *fill_map = NULL; > + void *ret = ERR_PTR(-ENOENT); > + struct nft_pipapo_field *f; > + int i; > + > + res_map = kmalloc_array(m->bsize_max, sizeof(*res_map), GFP_ATOMIC); > + if (!res_map) { > + ret = ERR_PTR(-ENOMEM); > + goto out; > + } > + > + fill_map = kcalloc(m->bsize_max, sizeof(*res_map), GFP_ATOMIC); > + if (!fill_map) { > + ret = ERR_PTR(-ENOMEM); > + goto out; > + } > + > + memset(res_map, 0xff, m->bsize_max * sizeof(*res_map)); > + > + nft_pipapo_for_each_field(f, i, m) { > + bool last = i == m->field_count - 1; > + unsigned long *lt = f->lt; > + int b, group; > + > + /* For each 4-bit group: select lookup table bucket depending on > + * packet bytes value, then AND bucket value > + */ > + for (group = 0; group < f->groups; group++) { > + u8 v; > + > + if (group % 2) { > + v = *data & 0x0f; > + data++; > + } else { > + v = *data >> 4; > + } > + __bitmap_and(res_map, res_map, lt + v * f->bsize, > + f->bsize * BITS_PER_LONG); > + > + lt += f->bsize * NFT_PIPAPO_BUCKETS; > + } > + > + /* Now populate the bitmap for the next field, unless this is > + * the last field, in which case return the matched 'ext' > + * pointer if any. > + * > + * Now res_map contains the matching bitmap, and fill_map is the > + * bitmap for the next field. > + */ > +next_match: > + b = pipapo_refill(res_map, f->bsize, f->rules, fill_map, f->mt, > + last); > + if (b < 0) > + goto out; > + > + if (last) { > + if (nft_set_elem_expired(&f->mt[b].e->ext)) > + goto next_match; > + > + if (flags & NFT_SET_ELEM_INTERVAL_END) > + ret = f->mt[b].e; > + else > + ret = f->mt[b].e->start; > + goto out; > + } > + > + data += NFT_PIPAPO_GROUPS_PADDING(f->groups); > + > + /* Swap bitmap indices: fill_map will be the initial bitmap for > + * the next field (i.e. the new res_map), and res_map is > + * guaranteed to be all-zeroes at this point, ready to be filled > + * according to the next mapping table. > + */ > + swap(res_map, fill_map); > + } > + > +out: > + kfree(fill_map); > + kfree(res_map); > + return ret; > +} > + > +/** > + * nft_pipapo_get() - Get matching element reference given key data > + * @net: Network namespace > + * @set: nftables API set representation > + * @elem: nftables API element representation containing key data > + * @flags: If NFT_SET_ELEM_INTERVAL_END is passed, return the end element > + */ > +static void *nft_pipapo_get(const struct net *net, const struct nft_set *set, > + const struct nft_set_elem *elem, unsigned int flags) > +{ > + return pipapo_get(net, set, (const u8 *)elem->key.val.data, flags); > +} > + > +/** > + * pipapo_resize() - Resize lookup or mapping table, or both > + * @f: Field containing lookup and mapping tables > + * @old_rules: Previous amount of rules in field > + * @rules: New amount of rules > + * > + * Increase, decrease or maintain tables size depending on new amount of rules, > + * and copy data over. In case the new size is smaller, throw away data for > + * highest-numbered rules. > + * > + * Return: 0 on success, -ENOMEM on allocation failure. > + */ > +static int pipapo_resize(struct nft_pipapo_field *f, int old_rules, int rules) > +{ > + long *new_lt = NULL, *new_p, *old_lt = f->lt, *old_p; > + union nft_pipapo_map_bucket *new_mt, *old_mt = f->mt; > + size_t new_bucket_size, copy; > + int group, bucket; > + > + new_bucket_size = DIV_ROUND_UP(rules, BITS_PER_LONG); > + > + if (new_bucket_size == f->bsize) > + goto mt; > + > + if (new_bucket_size > f->bsize) > + copy = f->bsize; > + else > + copy = new_bucket_size; > + > + new_lt = kvzalloc(f->groups * NFT_PIPAPO_BUCKETS * new_bucket_size * > + sizeof(*new_lt), GFP_KERNEL); > + if (!new_lt) > + return -ENOMEM; > + > + new_p = new_lt; > + old_p = old_lt; > + for (group = 0; group < f->groups; group++) { > + for (bucket = 0; bucket < NFT_PIPAPO_BUCKETS; bucket++) { > + memcpy(new_p, old_p, copy * sizeof(*new_p)); > + new_p += copy; > + old_p += copy; > + > + if (new_bucket_size > f->bsize) > + new_p += new_bucket_size - f->bsize; > + else > + old_p += f->bsize - new_bucket_size; > + } > + } > + > +mt: > + new_mt = kvmalloc(rules * sizeof(*new_mt), GFP_KERNEL); > + if (!new_mt) { > + kvfree(new_lt); > + return -ENOMEM; > + } > + > + memcpy(new_mt, f->mt, min(old_rules, rules) * sizeof(*new_mt)); > + if (rules > old_rules) { > + memset(new_mt + old_rules, 0, > + (rules - old_rules) * sizeof(*new_mt)); > + } > + > + if (new_lt) { > + f->bsize = new_bucket_size; > + f->lt = new_lt; > + kvfree(old_lt); > + } > + > + f->mt = new_mt; > + kvfree(old_mt); > + > + return 0; > +} > + > +/** > + * pipapo_bucket_set() - Set rule bit in bucket given group and group value > + * @f: Field containing lookup table > + * @rule: Rule index > + * @group: Group index > + * @v: Value of bit group > + */ > +static void pipapo_bucket_set(struct nft_pipapo_field *f, int rule, int group, > + int v) > +{ > + unsigned long *pos; > + > + pos = f->lt + f->bsize * NFT_PIPAPO_BUCKETS * group; > + pos += f->bsize * v; > + > + __set_bit(rule, pos); > +} > + > +/** > + * pipapo_insert() - Insert new rule in field given input key and mask length > + * @f: Field containing lookup table > + * @k: Input key for classification, without nftables padding > + * @mask_bits: Length of mask; matches field length for non-ranged entry > + * > + * Insert a new rule reference in lookup buckets corresponding to k and > + * mask_bits. > + * > + * Return: 1 on success (one rule inserted), negative error code on failure. > + */ > +static int pipapo_insert(struct nft_pipapo_field *f, const uint8_t *k, > + int mask_bits) > +{ > + int rule = f->rules++, group, ret; > + > + ret = pipapo_resize(f, f->rules - 1, f->rules); > + if (ret) > + return ret; > + > + for (group = 0; group < f->groups; group++) { > + int i, v; > + u8 mask; > + > + if (group % 2) > + v = k[group / 2] & 0x0f; > + else > + v = k[group / 2] >> 4; > + > + if (mask_bits >= (group + 1) * 4) { > + /* Not masked */ > + pipapo_bucket_set(f, rule, group, v); > + } else if (mask_bits <= group * 4) { > + /* Completely masked */ > + for (i = 0; i < NFT_PIPAPO_BUCKETS; i++) > + pipapo_bucket_set(f, rule, group, i); > + } else { > + /* The mask limit falls on this group */ > + mask = 0x0f >> (mask_bits - group * 4); > + for (i = 0; i < NFT_PIPAPO_BUCKETS; i++) { > + if ((i & ~mask) == (v & ~mask)) > + pipapo_bucket_set(f, rule, group, i); > + } > + } > + } > + > + return 1; > +} > + > +/** > + * pipapo_step_diff() - Check if setting @step bit in netmask would change it > + * @base: Mask we are expanding > + * @step: Step bit for given expansion step > + * @len: Total length of mask space (set and unset bits), bytes > + * > + * Convenience function for mask expansion. > + * > + * Return: true if step bit changes mask (i.e. isn't set), false otherwise. > + */ > +static bool pipapo_step_diff(u8 *base, int step, int len) > +{ > + /* Network order, byte-addressed */ > +#ifdef __BIG_ENDIAN__ > + return !(BIT(step % BITS_PER_BYTE) & base[step / BITS_PER_BYTE]); > +#else > + return !(BIT(step % BITS_PER_BYTE) & > + base[len - 1 - step / BITS_PER_BYTE]); > +#endif > +} > + > +/** > + * pipapo_step_after_end() - Check if mask exceeds range end with given step > + * @base: Mask we are expanding > + * @end: End of range > + * @step: Step bit for given expansion step, highest bit to be set > + * @len: Total length of mask space (set and unset bits), bytes > + * > + * Convenience function for mask expansion. > + * > + * Return: true if mask exceeds range setting step bits, false otherwise. > + */ > +static bool pipapo_step_after_end(const u8 *base, const u8 *end, int step, > + int len) > +{ > + u8 tmp[NFT_PIPAPO_MAX_BYTES]; > + int i; > + > + memcpy(tmp, base, len); > + > + /* Network order, byte-addressed */ > + for (i = 0; i <= step; i++) > +#ifdef __BIG_ENDIAN__ > + tmp[i / BITS_PER_BYTE] |= BIT(i % BITS_PER_BYTE); > +#else > + tmp[len - 1 - i / BITS_PER_BYTE] |= BIT(i % BITS_PER_BYTE); > +#endif > + > + return memcmp(tmp, end, len) > 0; > +} > + > +/** > + * pipapo_base_sum() - Sum step bit to given len-sized netmask base with carry > + * @base: Netmask base > + * @step: Step bit to sum > + * @len: Netmask length, bytes > + */ > +static void pipapo_base_sum(u8 *base, int step, int len) > +{ > + bool carry = false; > + int i; > + > + /* Network order, byte-addressed */ > +#ifdef __BIG_ENDIAN__ > + for (i = step / BITS_PER_BYTE; i < len; i++) { > +#else > + for (i = len - 1 - step / BITS_PER_BYTE; i >= 0; i--) { > +#endif > + if (carry) > + base[i]++; > + else > + base[i] += 1 << (step % BITS_PER_BYTE); > + > + if (base[i]) > + break; > + > + carry = true; > + } > +} > + > +/** > + * expand() - Expand range to composing netmasks and insert into lookup table > + * @f: Field containing lookup table > + * @start: Start of range > + * @end: End of range > + * @len: Length of value in bits > + * > + * Expand range to composing netmasks and insert corresponding rule references > + * in lookup buckets. > + * > + * Return: number of inserted rules on success, negative error code on failure. > + */ > +static int pipapo_expand(struct nft_pipapo_field *f, > + const u8 *start, const u8 *end, int len) > +{ > + int step, masks = 0, bytes = DIV_ROUND_UP(len, BITS_PER_BYTE); > + u8 base[NFT_PIPAPO_MAX_BYTES]; > + > + memcpy(base, start, bytes); > + while (memcmp(base, end, bytes) <= 0) { > + int err; > + > + step = 0; > + while (pipapo_step_diff(base, step, bytes)) { > + if (pipapo_step_after_end(base, end, step, bytes)) > + break; > + > + step++; > + if (step >= len) { > + if (!masks) { > + pipapo_insert(f, base, 0); > + masks = 1; > + } > + goto out; > + } > + } > + > + err = pipapo_insert(f, base, len - step); > + > + if (err < 0) > + return err; > + > + masks++; > + pipapo_base_sum(base, step, bytes); > + } > +out: > + return masks; > +} > + > +/** > + * pipapo_map() - Insert rules in mapping tables, mapping them between fields > + * @m: Matching data, including mapping table > + * @map: Table of rule maps: array of first rule and amount of rules > + * in next field a given rule maps to, for each field > + * @ext: For last field, nft_set_ext pointer matching rules map to > + */ > +static void pipapo_map(struct nft_pipapo_match *m, > + union nft_pipapo_map_bucket map[NFT_PIPAPO_MAX_FIELDS], > + struct nft_pipapo_elem *e) > +{ > + struct nft_pipapo_field *f; > + int i, j; > + > + for (i = 0, f = m->f; i < m->field_count - 1; i++, f++) { > + for (j = 0; j < map[i].n; j++) { > + f->mt[map[i].to + j].to = map[i + 1].to; > + f->mt[map[i].to + j].n = map[i + 1].n; > + } > + } > + > + /* Last field: map to ext instead of mapping to next field */ > + for (j = 0; j < map[i].n; j++) > + f->mt[map[i].to + j].e = e; > +} > + > +/** > + * pipapo_realloc_scratch() - Reallocate scratch maps for partial match results > + * @clone: Copy of matching data with pending insertions and deletions > + * @bsize_max Maximum bucket size, scratch maps cover two buckets > + * > + * Return: 0 on success, -ENOMEM on failure. > + */ > +static int pipapo_realloc_scratch(struct nft_pipapo_match *clone, > + unsigned long bsize_max) > +{ > + int i; > + > + for_each_possible_cpu(i) { > + unsigned long *scratch; > + > + scratch = kzalloc_node(bsize_max * sizeof(*scratch) * 2, > + GFP_KERNEL, cpu_to_node(i)); > + if (!scratch) { > + /* On failure, there's no need to undo previous > + * allocations: this means that some scratch maps have > + * a bigger allocated size now (this is only called on > + * insertion), but the extra space won't be used by any > + * CPU as new elements are not inserted and m->bsize_max > + * is not updated. > + */ > + return -ENOMEM; > + } > + > + kfree(*per_cpu_ptr(clone->scratch, i)); > + > + *per_cpu_ptr(clone->scratch, i) = scratch; > + } > + > + return 0; > +} > + > +/** > + * nft_pipapo_insert() - Validate and insert ranged elements > + * @net: Network namespace > + * @set: nftables API set representation > + * @elem: nftables API element representation containing key data > + * @flags: If NFT_SET_ELEM_INTERVAL_END is passed, this is the end element > + * @ext2: Filled with pointer to &struct nft_set_ext in inserted element > + * > + * In this set implementation, this functions needs to be called twice, with > + * start and end element, to obtain a valid entry insertion. > + * > + * Calls to this function are serialised with each other, so we can store > + * element and key data on the first call with start element, and use it on the > + * second call once we get the end element too. > + * > + * However, userspace could send a single NFT_SET_ELEM_INTERVAL_END element, > + * without a start element, so we need to check for it explicitly before > + * inserting an entry, lest we end up in nft_pipapo_walk() with an empty start > + * element. > + * > + * Also, we need to make sure that the start element hasn't been deactivated or > + * destroyed between the two calls to this function, otherwise we might link an > + * invalid start item to the end item triggering the insertion. Clear > + * priv->start_elem on any operation that might render it invalid. > + * > + * Return: 0 on success, error pointer on failure. > + */ > +static int nft_pipapo_insert(const struct net *net, const struct nft_set *set, > + const struct nft_set_elem *elem, > + struct nft_set_ext **ext2) > +{ > + const struct nft_set_ext *ext = nft_set_elem_ext(set, elem->priv); > + const u8 *data = (const u8 *)elem->key.val.data, *start, *end; > + union nft_pipapo_map_bucket rulemap[NFT_PIPAPO_MAX_FIELDS]; > + struct nft_pipapo *priv = nft_set_priv(set); > + struct nft_pipapo_match *m = priv->clone; > + struct nft_pipapo_elem *e = elem->priv; > + struct nft_pipapo_field *f; > + int i, bsize_max, err = 0; > + void *dup; > + > + dup = nft_pipapo_get(net, set, elem, 0); > + if (PTR_ERR(dup) != -ENOENT) { > + priv->start_elem = NULL; > + if (IS_ERR(dup)) > + return PTR_ERR(dup); > + *ext2 = dup; dup should be of nft_set_ext type. I just had a look at nft_pipapo_get() and I think this returns nft_pipapo_elem, which is almost good, since it contains nft_set_ext, right? I think you also need to check if the object is active in the next generation via nft_genmask_next() and nft_set_elem_active(), otherwise ignore it. Note that the datastructure needs to temporarily deal with duplicates, ie. one inactive object (just deleted) and one active object (just added) for the next generation. > + return -EEXIST; > + } > + > + if (!nft_set_ext_exists(ext, NFT_SET_EXT_FLAGS) || > + !(*nft_set_ext_flags(ext) & NFT_SET_ELEM_INTERVAL_END)) { > + priv->start_elem = e; > + *ext2 = &e->ext; > + memcpy(priv->start_data, data, priv->width); > + return 0; > + } > + > + if (!priv->start_elem) > + return -EINVAL; I'm working on a sketch patch to extend the front-end code to make this easier for you, will post it asap, so you don't need this special handling to collect both ends of the interval. So far, just spend a bit of time on this, will get back to you with more feedback. Thanks for working on this!