Re: [RFC PATCH] bpf: introduce new bpf map type BPF_MAP_TYPE_WILDCARD

Anton Protopopov <aspsk@xxxxxxxxxxxxx> · Thu, 8 Sep 2022 10:48:20 +0200

On 22/09/07 04:09, sdf@xxxxxxxxxx wrote:
> [sorry for the noise, got a bounce from the list, resend with the
> message trimmed]
> 
> On 09/07, Anton Protopopov wrote:
> > Add a new map, BPF_MAP_TYPE_WILDCARD, which provides means to implement
> > generic
> > online packet classification. Here "online" stands for "fast lookups and
> > fast
> > updates", and "generic" means that a user can create maps with custom
> > lookup
> > schemes—different numbers of fields and different interpretation of
> > individual
> > fields (prefix bitmasks, ranges, and direct matches).
> 
> > In particular, in Cilium we have several use cases for such a map:
> 
> >    * XDP Prefilter is a simple XDP-based DDoS mitigation system provided by
> >      Cilium. At the moment it only supports filtering by source CIDRs.
> > It would
> >      benefit from using this new map, as it allows to utilize wildcard
> > rules
> >      without big penalty comparing to one hash or LPM lookup utilized now.
> 
> >    * XDP Packet Recorder (see
> > https://lpc.events/event/11/contributions/953/)
> 
> >    * K8S and Cilium Network Policies: as of k8s 1.25 port ranges are
> > considered
> >      to be a stable feature, and this map allows to implement this
> > easily (and
> >      also to provide more sophisticated filtering for Cilium Network
> > Policies)
> 
> > Keys for wildcard maps are defined using the struct wildcard_key
> > structure.
> > Besides the type field, it contains the data field of an arbitrary size.
> > To
> > educate map about what's contained inside the data field, two additional
> > structures are used. The first one, struct wildcard_desc, also of
> > arbitrary
> > size, tells how many fields are contained inside data, and the struct
> > wildcard_rule_desc structure defines how individual fields look like.
> 
> > Fields (rules) can be of three types:
> > BPF_WILDCARD_RULE_{PREFIX,RANGE,MATCH}.
> 
> [..]
> 
> > The PREFIX rule means that inside data we have a binary value and a binary
> > (prefix) mask:
> 
> >                 size             u32
> >          <----------------> <----------->
> >     ... |    rule value    |   prefix   | ...
> 
> > Here rule value is a binary value, e.g., 123.324.128.0, and prefix is a
> > u32 bit
> > variable; we only use lower 8 bits of it. We allow 8, 16, 32, 64, and
> > 128 bit
> > values for PREFIX rules.
> 
> I haven't looked at the code, so pardon stupid questions. This sounds
> like optimized LPM-trie?
> 
> If not, what's the difference?

First, this map provides more generic API than LPM. Namely, we allow not one,
but multiple prefixes (say, to specify both source and destination prefix), and
also not only prefixes, but ranges. See the example I've just posted in reply
to Daniel, but in short, we can specify rules like

  (192.68.0.0/16, 10.0.0.0/24, *, 22)

We are also not limited by 4-tuples, we can create any combination of rules.

Second, the [tuple merge implementation of this] map uses hash tables to do
lookups, not tries.

I also should have mentioned that I have a talk on Plumbers next Tuesday about
this map, I will talk about API and algorithms there, and provide some numbers.
See https://lpc.events/event/16/contributions/1356/

> If yes, can this be special cased/optimized in the existing LPM-trie
> optimization? I think we've tried in the past, mostly gave up and
> "fixed" by caching the state in the socket local storage.
> 
> Also, fixed 8/16/32/64 prefixes seems like a big limitation? At least if
> I were to store ipv4 from the classless (cidr) world..

This is not for prefixes, but for the field lengths. The basic use case for
this map is to filter packets, so we can create 4- or 5-tuple IPv4 or IPv6
maps. In the first case our field lengths will be (32, 32, 16, 16), in the
second - (128, 128, 16, 16). Prefixes can be of any length from 0 up to the
field size.

> > The RANGE rule is determined by two binary values: minimum and maximum,
> > treated
> > as unsigned integers of appropriate size:
> 
> >                 size               size
> >          <----------------> <---------------->
> >     ... |  min rule value  |  max rule value  | ...
> 
> > We only allow the 8, 16, 32, and 64-bit for RANGE rules.
> 
> That seems useful. I was thinking about similar 'rangemap' where
> we can effectively store and lookup [a,b] ranges. Might be useful
> in firewalls for storing lists of ports efficiently. So why not
> a separate map instead? Are you able to share a lot of code with
> the general matching map?

Yes, all is included inside one map. For firewalls users can create individual
maps with different specifications. Say, one can filter 4-tuples, or 5-tuples,
or (src, dst, dst port), or (flow-label, dest port), etc.

> > The MATCH rule is determined by one binary value, and is basically the
> > same as
> > (X,sizeof(X)*8) PREFIX rule, but can be processed a bit faster:
> 
> >                 size
> >          <---------------->
> >     ... |    rule value    | ...
> 
> > To speed up processing all the rules, including the prefix field, should
> > be
> > stored in host byte order, and all elements in network byte order. 16-byte
> > fields are stored as {lo,hi}—lower eight bytes, then higher eight bytes.
> 
> > For elements only values are stored.
> 
> Can these optimization be auto-magically derived whenever PREFIX map
> with the right values is created? Why put this decision on the users?

Thanks for pointing this out. I am not really proud of this particular
interface...

For packet filtering values tend to appear in network byte order, so we can
just assume this. We definitely can optimize values to the right order for
rules internally when bpf_map_update_elem is called.

We also can add a map flag to process values in host byte order. This can be
helpful for a non-networking usage, when values appear in host byte order
naturally.

> > To simplify definition of key structures, the
> > BPF_WILDCARD_DESC_{1,2,3,4,5}
> > macros should be used. For example, one can define an IPv4 4-tuple keys as
> > follows:
> 
> >     BPF_WILDCARD_DESC_4(
> >          capture4_wcard,
> >          BPF_WILDCARD_RULE_PREFIX, __u32, saddr,
> >          BPF_WILDCARD_RULE_PREFIX, __u32, daddr,
> >          BPF_WILDCARD_RULE_RANGE, __u16, sport,
> >          BPF_WILDCARD_RULE_RANGE, __u16, dport
> >     );
> 
> > This macro will define the following structure:
> 
> >     struct capture4_wcard_key {
> >          __u32 type;
> >          __u32 priority;
> >          union {
> >              struct {
> >                      __u32 saddr;
> >                      __u32 saddr_prefix;
> >                      __u32 daddr;
> >                      __u32 daddr_prefix;
> >                      __u16 sport_min;
> >                      __u16 sport_max;
> >                      __u16 dport_min;
> >                      __u16 dport_max;
> >              } __packed rule;
> >              struct {
> >                      __u32 saddr;
> >                      __u32 daddr;
> >                      __u16 sport;
> >                      __u16 dport;
> >              } __packed;
> >          };
> >     } __packed;
> 
> > Here type field should contain either BPF_WILDCARD_KEY_RULE or
> > BPF_WILDCARD_KEY_ELEM so that kernel can differentiate between rules and
> > elements. The rule structure is used to define (and lookup) rules, the
> > unnamed
> > structure can be used to specify elements when matching them with rules.
> 
> > In order to simplify definition of a corresponding struct wildcard_desc,
> > the
> > BPF_WILDCARD_DESC_* macros will create yet another structure:
> 
> >     struct capture4_wcard_desc {
> >          __uint(n_rules, 4);
> >          struct {
> >                  __uint(type, BPF_WILDCARD_RULE_PREFIX);
> >                  __uint(size, sizeof(__u32));
> >          } saddr;
> >          struct {
> >                  __uint(type, BPF_WILDCARD_RULE_PREFIX);
> >                  __uint(size, sizeof(__u32));
> >          } daddr;
> >          struct {
> >                  __uint(type, BPF_WILDCARD_RULE_RANGE);
> >                  __uint(size, sizeof(__u16));
> >          } sport;
> >          struct {
> >                  __uint(type, BPF_WILDCARD_RULE_RANGE);
> >                  __uint(size, sizeof(__u16));
> >          } dport;
> >     };
> 
> > This structure can be used in a (BTF) map definition as follows:
> 
> >      __type(wildcard_desc, struct capture4_wcard_desc);
> 
> > Then libbpf will create a corresponding struct wildcard_desc and pass it
> > to
> > kernel in bpf_attr using new map_extra_data/map_extra_data_size fields.
> 
> [..]
> 
> > The map implementation allows users to specify which algorithm to use to
> > store
> > rules and lookup packets. Currently, three algorithms are supported:
> 
> >    * Brute Force (suitable for map sizes of about 32 or below elements)
> 
> >    * Tuple Merge (a variant of the Tuple Merge algorithm described in the
> >      "TupleMerge: Fast Software Packet Processing for Online Packet
> >      Classification" white paper, see
> > https://nonsns.github.io/paper/rossi19ton.pdf.
> >      The Tuple Merge algorithm is not protected by any patents.)
> 
> >    * Static Tuple Merge (a variant of Tuple Merge where a set of lookup
> > tables
> >      is directly provided by a user)
> 
> As a user that has no clue how this map works, how do I decide which
> algorithm to use? What I like with the current maps is that they don't
> leak too much of their inner state. These controls seems a bit low
> level?

The idea to let users to select an algorithm appeared because the map is pretty
generic, and some algorithms may work better for some cases. You're right that
there shouldn't be need to specify algorithm. (In fact, users can omit the
algorithm flag now, and the default algorithm will be used. I also marked to
myself to setup the right default algorithm, as now this seems to be the brute
force...).