Hi Anton, On Wed, Sep 07, 2022 at 10:01:40AM +0200, Anton Protopopov wrote: > Add a new map, BPF_MAP_TYPE_WILDCARD, which provides means to implement generic > online packet classification. Here "online" stands for "fast lookups and fast > updates", and "generic" means that a user can create maps with custom lookup > schemes—different numbers of fields and different interpretation of individual > fields (prefix bitmasks, ranges, and direct matches). > > In particular, in Cilium we have several use cases for such a map: > > * XDP Prefilter is a simple XDP-based DDoS mitigation system provided by > Cilium. At the moment it only supports filtering by source CIDRs. It would > benefit from using this new map, as it allows to utilize wildcard rules > without big penalty comparing to one hash or LPM lookup utilized now. > > * XDP Packet Recorder (see https://lpc.events/event/11/contributions/953/) > > * K8S and Cilium Network Policies: as of k8s 1.25 port ranges are considered > to be a stable feature, and this map allows to implement this easily (and > also to provide more sophisticated filtering for Cilium Network Policies) > > Keys for wildcard maps are defined using the struct wildcard_key structure. > Besides the type field, it contains the data field of an arbitrary size. To > educate map about what's contained inside the data field, two additional > structures are used. The first one, struct wildcard_desc, also of arbitrary > size, tells how many fields are contained inside data, and the struct > wildcard_rule_desc structure defines how individual fields look like. > > Fields (rules) can be of three types: BPF_WILDCARD_RULE_{PREFIX,RANGE,MATCH}. > The PREFIX rule means that inside data we have a binary value and a binary > (prefix) mask: > > size u32 > <----------------> <-----------> > ... | rule value | prefix | ... > > Here rule value is a binary value, e.g., 123.324.128.0, and prefix is a u32 bit > variable; we only use lower 8 bits of it. We allow 8, 16, 32, 64, and 128 bit > values for PREFIX rules. > > The RANGE rule is determined by two binary values: minimum and maximum, treated > as unsigned integers of appropriate size: > > size size > <----------------> <----------------> > ... | min rule value | max rule value | ... > > We only allow the 8, 16, 32, and 64-bit for RANGE rules. > > The MATCH rule is determined by one binary value, and is basically the same as > (X,sizeof(X)*8) PREFIX rule, but can be processed a bit faster: > > size > <----------------> > ... | rule value | ... > > To speed up processing all the rules, including the prefix field, should be > stored in host byte order, and all elements in network byte order. 16-byte > fields are stored as {lo,hi}—lower eight bytes, then higher eight bytes. > > For elements only values are stored. > > To simplify definition of key structures, the BPF_WILDCARD_DESC_{1,2,3,4,5} > macros should be used. For example, one can define an IPv4 4-tuple keys as > follows: > > BPF_WILDCARD_DESC_4( > capture4_wcard, > BPF_WILDCARD_RULE_PREFIX, __u32, saddr, > BPF_WILDCARD_RULE_PREFIX, __u32, daddr, > BPF_WILDCARD_RULE_RANGE, __u16, sport, > BPF_WILDCARD_RULE_RANGE, __u16, dport > ); > > This macro will define the following structure: > > struct capture4_wcard_key { > __u32 type; > __u32 priority; > union { > struct { > __u32 saddr; > __u32 saddr_prefix; > __u32 daddr; > __u32 daddr_prefix; > __u16 sport_min; > __u16 sport_max; > __u16 dport_min; > __u16 dport_max; > } __packed rule; > struct { > __u32 saddr; > __u32 daddr; > __u16 sport; > __u16 dport; > } __packed; > }; > } __packed; > > Here type field should contain either BPF_WILDCARD_KEY_RULE or > BPF_WILDCARD_KEY_ELEM so that kernel can differentiate between rules and > elements. The rule structure is used to define (and lookup) rules, the unnamed > structure can be used to specify elements when matching them with rules. > > In order to simplify definition of a corresponding struct wildcard_desc, the > BPF_WILDCARD_DESC_* macros will create yet another structure: > > struct capture4_wcard_desc { > __uint(n_rules, 4); > struct { > __uint(type, BPF_WILDCARD_RULE_PREFIX); > __uint(size, sizeof(__u32)); > } saddr; > struct { > __uint(type, BPF_WILDCARD_RULE_PREFIX); > __uint(size, sizeof(__u32)); > } daddr; > struct { > __uint(type, BPF_WILDCARD_RULE_RANGE); > __uint(size, sizeof(__u16)); > } sport; > struct { > __uint(type, BPF_WILDCARD_RULE_RANGE); > __uint(size, sizeof(__u16)); > } dport; > }; > > This structure can be used in a (BTF) map definition as follows: > > __type(wildcard_desc, struct capture4_wcard_desc); > > Then libbpf will create a corresponding struct wildcard_desc and pass it to > kernel in bpf_attr using new map_extra_data/map_extra_data_size fields. > > The map implementation allows users to specify which algorithm to use to store > rules and lookup packets. Currently, three algorithms are supported: > > * Brute Force (suitable for map sizes of about 32 or below elements) > > * Tuple Merge (a variant of the Tuple Merge algorithm described in the > "TupleMerge: Fast Software Packet Processing for Online Packet > Classification" white paper, see https://nonsns.github.io/paper/rossi19ton.pdf. > The Tuple Merge algorithm is not protected by any patents.) > > * Static Tuple Merge (a variant of Tuple Merge where a set of lookup tables > is directly provided by a user) > > To select a specific algorithm one should set a flag in the map_extra field, > see examples below. > > Example 1. The following defines a brute force map to match IPv4 > source/destination CIDRs: > > BPF_WILDCARD_DESC_2( > capture4_l3, > BPF_WILDCARD_RULE_PREFIX, __u32, saddr, > BPF_WILDCARD_RULE_PREFIX, __u32, daddr, > ); > > struct { > __uint(type, BPF_MAP_TYPE_WILDCARD); > __type(key, struct capture4_l3_key); > __type(value, __u64); > __uint(max_entries, 16); > __uint(map_flags, BPF_F_NO_PREALLOC); > __uint(map_extra, BPF_WILDCARD_F_ALGORITHM_BF); > __type(wildcard_desc, struct capture4_l3_desc); > } filter_v4_bf __section(".maps"); > > Example 2. The only change requred for the previous example to use Tuple Merge > is to select a different algorithm: > > struct { > __uint(type, BPF_MAP_TYPE_WILDCARD); > __type(key, struct capture4_l3_key); > __type(value, __u64); > __uint(max_entries, 16); > __uint(map_flags, BPF_F_NO_PREALLOC); > __uint(map_extra, BPF_WILDCARD_F_ALGORITHM_TM); > __type(wildcard_desc, struct capture4_l3_desc); > } filter_v4_tm __section(".maps"); > > Example 3. Let's update the previous example to use Static Tuple Merge: > > BPF_WILDCARD_TM_OPTS( > capture4_l3, > BPF_WILDCARD_TM_TABLES_3(saddr, 16, 0, 16); > BPF_WILDCARD_TM_TABLES_3(daddr, 16, 16, 0); > ); > > struct { > __uint(type, BPF_MAP_TYPE_WILDCARD); > __type(key, struct capture4_l3_key); > __type(value, __u64); > __uint(max_entries, 100000); > __uint(map_flags, BPF_F_NO_PREALLOC); > __uint(map_extra, BPF_WILDCARD_F_ALGORITHM_TM | > BPF_WILDCARD_F_TM_STATIC_POOL | > BPF_WILDCARD_F_TM_POOL_LIST); > __type(wildcard_tm_opts, struct capture4_l3_opts); > } filter_v4_bf __section(".maps"); > > Here we set new flags to specify that a pool of tables should be used to select > new tables from (BPF_WILDCARD_F_TM_STATIC_POOL), and that the field > wildcard_tm_opts contains a list of tables to use (BPF_WILDCARD_F_TM_POOL_LIST, > an alternative is to use a Cartesian product of arrays provided). The > capture4_l3_opts is defined by a helper macro BPF_WILDCARD_TM_OPTS, where for > each field we define a list of prefixes to use. If a field is missing, then it > will be always ignored. > > The following changes and are not part of this RFC, but planned to be added before v1: > > * implement priorities, i.e., users will be able to specify rule priority as > u32 and rules with lower priorities will be matched first > > * implement !BPF_F_NO_PREALLOC: right now we always kalloc both elements and > new tables > > Signed-off-by: Anton Protopopov <aspsk@xxxxxxxxxxxxx> > --- > include/linux/bpf_types.h | 1 + > include/uapi/linux/bpf.h | 127 +++ > kernel/bpf/Makefile | 2 +- > kernel/bpf/syscall.c | 51 +- > kernel/bpf/wildcard.c | 1526 +++++++++++++++++++++++++++++++ > tools/include/uapi/linux/bpf.h | 129 ++- > tools/lib/bpf/bpf.c | 8 + > tools/lib/bpf/bpf.h | 5 +- > tools/lib/bpf/libbpf.c | 423 +++++++++ > tools/lib/bpf/libbpf_internal.h | 5 +- > 10 files changed, 2269 insertions(+), 8 deletions(-) > create mode 100644 kernel/bpf/wildcard.c > I'm interested in the goal of your patch but I'm having a grokking how the map is actually used. Is there any chance you could include a selftest or example? Thanks, Daniel [...]