Re: [RFC PATCH] bpf: introduce new bpf map type BPF_MAP_TYPE_WILDCARD

Anton Protopopov <aspsk@xxxxxxxxxxxxx> · Mon, 12 Sep 2022 10:21:20 +0200

On 22/09/09 11:55, sdf@xxxxxxxxxx wrote:
> On 09/09, Anton Protopopov wrote:
> > On 22/09/08 03:37, sdf@xxxxxxxxxx wrote:
> > > On 09/08, Anton Protopopov wrote:
> > > > On 22/09/07 04:09, sdf@xxxxxxxxxx wrote:
> > > > > [sorry for the noise, got a bounce from the list, resend with the
> > > > > message trimmed]
> > > > >
> > > > > On 09/07, Anton Protopopov wrote:
> > > > > > Add a new map, BPF_MAP_TYPE_WILDCARD, which provides means to
> > > > implement
> > > > > > generic
> > > > > > online packet classification. Here "online" stands for "fast
> > lookups
> > > > and
> > > > > > fast
> > > > > > updates", and "generic" means that a user can create maps with
> > custom
> > > > > > lookup
> > > > > > schemes—different numbers of fields and different interpretation
> > of
> > > > > > individual
> > > > > > fields (prefix bitmasks, ranges, and direct matches).
> > > > >
> > > > > > In particular, in Cilium we have several use cases for such a map:
> > > > >
> > > > > >    * XDP Prefilter is a simple XDP-based DDoS mitigation system
> > > > provided by
> > > > > >      Cilium. At the moment it only supports filtering by source
> > CIDRs.
> > > > > > It would
> > > > > >      benefit from using this new map, as it allows to utilize
> > wildcard
> > > > > > rules
> > > > > >      without big penalty comparing to one hash or LPM lookup
> > > > utilized now.
> > > > >
> > > > > >    * XDP Packet Recorder (see
> > > > > > https://lpc.events/event/11/contributions/953/)
> > > > >
> > > > > >    * K8S and Cilium Network Policies: as of k8s 1.25 port ranges
> > are
> > > > > > considered
> > > > > >      to be a stable feature, and this map allows to implement this
> > > > > > easily (and
> > > > > >      also to provide more sophisticated filtering for Cilium
> > Network
> > > > > > Policies)
> > > > >
> > > > > > Keys for wildcard maps are defined using the struct wildcard_key
> > > > > > structure.
> > > > > > Besides the type field, it contains the data field of an arbitrary
> > > > size.
> > > > > > To
> > > > > > educate map about what's contained inside the data field, two
> > > > additional
> > > > > > structures are used. The first one, struct wildcard_desc, also of
> > > > > > arbitrary
> > > > > > size, tells how many fields are contained inside data, and the
> > struct
> > > > > > wildcard_rule_desc structure defines how individual fields look
> > like.
> > > > >
> > > > > > Fields (rules) can be of three types:
> > > > > > BPF_WILDCARD_RULE_{PREFIX,RANGE,MATCH}.
> > > > >
> > > > > [..]
> > > > >
> > > > > > The PREFIX rule means that inside data we have a binary value
> > and a
> > > > binary
> > > > > > (prefix) mask:
> > > > >
> > > > > >                 size             u32
> > > > > >          <----------------> <----------->
> > > > > >     ... |    rule value    |   prefix   | ...
> > > > >
> > > > > > Here rule value is a binary value, e.g., 123.324.128.0, and prefix
> > > > is a
> > > > > > u32 bit
> > > > > > variable; we only use lower 8 bits of it. We allow 8, 16, 32,
> > 64, and
> > > > > > 128 bit
> > > > > > values for PREFIX rules.
> > > > >
> > > > > I haven't looked at the code, so pardon stupid questions. This
> > sounds
> > > > > like optimized LPM-trie?
> > > > >
> > > > > If not, what's the difference?
> > >
> > > > First, this map provides more generic API than LPM. Namely, we allow
> > not
> > > > one,
> > > > but multiple prefixes (say, to specify both source and destination
> > > > prefix), and
> > > > also not only prefixes, but ranges. See the example I've just posted
> > in
> > > > reply
> > > > to Daniel, but in short, we can specify rules like
> > >
> > > >    (192.68.0.0/16, 10.0.0.0/24, *, 22)
> > >
> > > > We are also not limited by 4-tuples, we can create any combination of
> > > > rules.
> > >
> > > Ah, ok, that seems useful. But fundamentally looks like a
> > > map-in-a-map-in-a-map-in-a-map? So, in theory, at least the ip prefix
> > > lookups
> > > already can be implemented with LPM (albeit slower)?
> 
> > Well, this is just a single map :)
> 
> > We have a use case for one prefix lookups in Cilium. The XDP prefilter
> > uses
> > either a hash or an LPM (hash is used if only precise match rules
> > specified,
> > and LPM - if there are CIDR rules):
> 
> >    https://github.com/cilium/cilium/blob/master/bpf/bpf_xdp.c#L60-L68
> 
> > Here the new map works faster than LPM (e.g., for 5000K rules this is
> > about
> > 120ns vs. 80ns for an [IPv6] map lookup on AMD Ryzen 9). With the new
> > map we
> > also will be able to specify more sophisticated rules for the XDP
> > prefilter,
> > not only source CIDRs.
> 
> Ack, I'm just trying to understand whether decomposing this into smaller
> units might be beneficial. I don't think I have an answer, only asking
> questions :-)

Sure, thanks for the questions! For the smaller pieces, this is possible to
just configure a "one-range" map or "one-prefix" map. E.g., in Cilium we have
a usecase where we need a "(u32 ID, u16 range)" map (for security id + port
range).

> > > > Second, the [tuple merge implementation of this] map uses hash
> > tables to
> > > > do
> > > > lookups, not tries.
> > >
> > > > I also should have mentioned that I have a talk on Plumbers next
> > Tuesday
> > > > about
> > > > this map, I will talk about API and algorithms there, and provide some
> > > > numbers.
> > > > See https://lpc.events/event/16/contributions/1356/
> > >
> > > Awesome, hopefully I'll be there as well, will come say hi :-)
> 
> > Unluckily, I will be remote this time
> 
> Still looking forward to the talk! Hopefully you'll get some useful
> feedback out of it.
> 
> > > From that link, regarding the following:
> > >   Thus, both of algorithms are [nearly?] impossible to implement in
> > "pure"
> > >   BPF due to lack of functionality and also due to verifier complexity
> > > limits.
> > >
> > > There is a natural question here: what's missing? Can we extend the
> > > bpf machinery to lift some of these restrictions? Verifier complexity
> > > limits are pretty different now compared to even 2 years ago..
> 
> > The complexity, in fact, is a big concern for us in Cilium - we already
> > have
> > rather large programs. Adding, at least, a few hundreds lines vs. just
> > doing a
> > single map lookup is a big deal. (We also have to support older kernels.)
> 
> > The Tuple Merge implementation uses a [custom] hash table underneath. The
> > difference with the normal hash table is that we actually have multiple
> > hash
> > tables inside one - we split rules in several "tables" and do a hash
> > lookup for
> > each table (and to resolve collisions hash table elements know to which
> > table
> > they do belong).
> 
> Agreed, the tradeoff is basically:
> 
> - solve your "narrow" usecase
> - solve a bigger problem that will allow you to solve your usecase as
>   well (say, write all those algorithms in bpf; that won't work for the
>   older kernels, but it will lessen the maintenance burden on the
>   kernel)
> 
> I'm not sure where that balance is in this particular case. Maybe let's
> get back to discussion after the LPC, hopefully I'll get more context.
> 
> > > > > If yes, can this be special cased/optimized in the existing LPM-trie
> > > > > optimization? I think we've tried in the past, mostly gave up and
> > > > > "fixed" by caching the state in the socket local storage.
> > > > >
> > > > > Also, fixed 8/16/32/64 prefixes seems like a big limitation? At
> > least if
> > > > > I were to store ipv4 from the classless (cidr) world..
> > >
> > > > This is not for prefixes, but for the field lengths. The basic use
> > case
> > > > for
> > > > this map is to filter packets, so we can create 4- or 5-tuple IPv4
> > or IPv6
> > > > maps. In the first case our field lengths will be (32, 32, 16, 16),
> > in the
> > > > second - (128, 128, 16, 16). Prefixes can be of any length from 0 up
> > to
> > > > the
> > > > field size.
> > >
> > > Understood, thx.
> > >
> > > > > > The RANGE rule is determined by two binary values: minimum and
> > > > maximum,
> > > > > > treated
> > > > > > as unsigned integers of appropriate size:
> > > > >
> > > > > >                 size               size
> > > > > >          <----------------> <---------------->
> > > > > >     ... |  min rule value  |  max rule value  | ...
> > > > >
> > > > > > We only allow the 8, 16, 32, and 64-bit for RANGE rules.
> > > > >
> > > > > That seems useful. I was thinking about similar 'rangemap' where
> > > > > we can effectively store and lookup [a,b] ranges. Might be useful
> > > > > in firewalls for storing lists of ports efficiently. So why not
> > > > > a separate map instead? Are you able to share a lot of code with
> > > > > the general matching map?
> > >
> > > > Yes, all is included inside one map. For firewalls users can create
> > > > individual
> > > > maps with different specifications. Say, one can filter 4-tuples, or
> > > > 5-tuples,
> > > > or (src, dst, dst port), or (flow-label, dest port), etc.
> > >
> > > > > > The MATCH rule is determined by one binary value, and is
> > basically the
> > > > > > same as
> > > > > > (X,sizeof(X)*8) PREFIX rule, but can be processed a bit faster:
> > > > >
> > > > > >                 size
> > > > > >          <---------------->
> > > > > >     ... |    rule value    | ...
> > > > >
> > > > > > To speed up processing all the rules, including the prefix field,
> > > > should
> > > > > > be
> > > > > > stored in host byte order, and all elements in network byte order.
> > > > 16-byte
> > > > > > fields are stored as {lo,hi}—lower eight bytes, then higher eight
> > > > bytes.
> > > > >
> > > > > > For elements only values are stored.
> > > > >
> > > > > Can these optimization be auto-magically derived whenever PREFIX map
> > > > > with the right values is created? Why put this decision on the
> > users?
> > >
> > > > Thanks for pointing this out. I am not really proud of this particular
> > > > interface...
> > >
> > > > For packet filtering values tend to appear in network byte order, so
> > we
> > > > can
> > > > just assume this. We definitely can optimize values to the right
> > order for
> > > > rules internally when bpf_map_update_elem is called.
> > >
> > > > We also can add a map flag to process values in host byte order. This
> > > > can be
> > > > helpful for a non-networking usage, when values appear in host byte
> > order
> > > > naturally.
> > >
> > > Yeah, it seems like the less uapi we export - the better. Maybe some of
> > > that endiannes can be deduced via BTF (be32 vs u32) so we can
> > > transparently apply these optimizations?
> 
> > I am not sure how to do this automatically. Say, users can define an IPv4
> > address as __u32, or __u8[4], or __be32. And they can also legally mix the
> > definitions, as bpf(2) only says "gimme a void *key".
> 
> You don't have to solve it for everything:
> __u32 and __u8[4] won't get those optimizations, but __be32/__be8[4] will.
> (assuming endianness signal can be taken from the key's BTF if it preserves
> __be32 vs __u32 distinction).
> 
> And then, somewhere in the documentation, you can say: the preferred way
> to store ip addresses is __be32; this might (or may not) deliver better
> performance in some cases.
> 
> This way, we are free to add more heuristics without having a written
> contract with the users about this extra mode. At some point,
> we might figure out how to make this work more efficiently for regular __u32
> as well.
> 
> (pardon me if I'm spilling some nonsense, maybe we should really get
> back to this after LPC)

Yes, in the best case I will get some feedback about use cases other people have.

> > Users don't have problems with specifying, say, LPM keys or socket
> > options in
> > network byte order, so can we just say: use network byte order for your
> > values?
> 
> > > > > > To simplify definition of key structures, the
> > > > > > BPF_WILDCARD_DESC_{1,2,3,4,5}
> > > > > > macros should be used. For example, one can define an IPv4 4-tuple
> > > > keys as
> > > > > > follows:
> > > > >
> > > > > >     BPF_WILDCARD_DESC_4(
> > > > > >          capture4_wcard,
> > > > > >          BPF_WILDCARD_RULE_PREFIX, __u32, saddr,
> > > > > >          BPF_WILDCARD_RULE_PREFIX, __u32, daddr,
> > > > > >          BPF_WILDCARD_RULE_RANGE, __u16, sport,
> > > > > >          BPF_WILDCARD_RULE_RANGE, __u16, dport
> > > > > >     );
> > > > >
> > > > > > This macro will define the following structure:
> > > > >
> > > > > >     struct capture4_wcard_key {
> > > > > >          __u32 type;
> > > > > >          __u32 priority;
> > > > > >          union {
> > > > > >              struct {
> > > > > >                      __u32 saddr;
> > > > > >                      __u32 saddr_prefix;
> > > > > >                      __u32 daddr;
> > > > > >                      __u32 daddr_prefix;
> > > > > >                      __u16 sport_min;
> > > > > >                      __u16 sport_max;
> > > > > >                      __u16 dport_min;
> > > > > >                      __u16 dport_max;
> > > > > >              } __packed rule;
> > > > > >              struct {
> > > > > >                      __u32 saddr;
> > > > > >                      __u32 daddr;
> > > > > >                      __u16 sport;
> > > > > >                      __u16 dport;
> > > > > >              } __packed;
> > > > > >          };
> > > > > >     } __packed;
> > > > >
> > > > > > Here type field should contain either BPF_WILDCARD_KEY_RULE or
> > > > > > BPF_WILDCARD_KEY_ELEM so that kernel can differentiate between
> > rules
> > > > and
> > > > > > elements. The rule structure is used to define (and lookup)
> > rules, the
> > > > > > unnamed
> > > > > > structure can be used to specify elements when matching them with
> > > > rules.
> > > > >
> > > > > > In order to simplify definition of a corresponding struct
> > > > wildcard_desc,
> > > > > > the
> > > > > > BPF_WILDCARD_DESC_* macros will create yet another structure:
> > > > >
> > > > > >     struct capture4_wcard_desc {
> > > > > >          __uint(n_rules, 4);
> > > > > >          struct {
> > > > > >                  __uint(type, BPF_WILDCARD_RULE_PREFIX);
> > > > > >                  __uint(size, sizeof(__u32));
> > > > > >          } saddr;
> > > > > >          struct {
> > > > > >                  __uint(type, BPF_WILDCARD_RULE_PREFIX);
> > > > > >                  __uint(size, sizeof(__u32));
> > > > > >          } daddr;
> > > > > >          struct {
> > > > > >                  __uint(type, BPF_WILDCARD_RULE_RANGE);
> > > > > >                  __uint(size, sizeof(__u16));
> > > > > >          } sport;
> > > > > >          struct {
> > > > > >                  __uint(type, BPF_WILDCARD_RULE_RANGE);
> > > > > >                  __uint(size, sizeof(__u16));
> > > > > >          } dport;
> > > > > >     };
> > > > >
> > > > > > This structure can be used in a (BTF) map definition as follows:
> > > > >
> > > > > >      __type(wildcard_desc, struct capture4_wcard_desc);
> > > > >
> > > > > > Then libbpf will create a corresponding struct wildcard_desc and
> > > > pass it
> > > > > > to
> > > > > > kernel in bpf_attr using new map_extra_data/map_extra_data_size
> > > > fields.
> > > > >
> > > > > [..]
> > > > >
> > > > > > The map implementation allows users to specify which algorithm to
> > > > use to
> > > > > > store
> > > > > > rules and lookup packets. Currently, three algorithms are
> > supported:
> > > > >
> > > > > >    * Brute Force (suitable for map sizes of about 32 or below
> > > > elements)
> > > > >
> > > > > >    * Tuple Merge (a variant of the Tuple Merge algorithm described
> > > > in the
> > > > > >      "TupleMerge: Fast Software Packet Processing for Online
> > Packet
> > > > > >      Classification" white paper, see
> > > > > > https://nonsns.github.io/paper/rossi19ton.pdf.
> > > > > >      The Tuple Merge algorithm is not protected by any patents.)
> > > > >
> > > > > >    * Static Tuple Merge (a variant of Tuple Merge where a set of
> > > > lookup
> > > > > > tables
> > > > > >      is directly provided by a user)
> > > > >
> > > > > As a user that has no clue how this map works, how do I decide which
> > > > > algorithm to use? What I like with the current maps is that they
> > don't
> > > > > leak too much of their inner state. These controls seems a bit low
> > > > > level?
> > >
> > > > The idea to let users to select an algorithm appeared because the
> > map is
> > > > pretty
> > > > generic, and some algorithms may work better for some cases. You're
> > > > right that
> > > > there shouldn't be need to specify algorithm. (In fact, users can
> > omit the
> > > > algorithm flag now, and the default algorithm will be used. I also
> > > > marked to
> > > > myself to setup the right default algorithm, as now this seems to be
> > the
> > > > brute
> > > > force...).
> > >
> > > The reason I ask is: it seems it's just a matter of time until there
> > will be
> > > an algorithm called BPF_WILDCARD_F_ALGORITHM_BPF where you can pass a
> > BPF
> > > program that implements a custom one :-)
> 
> > ... which fits the current implementation :)
> 
> > If we don't let users to specify an algorithm, then do we let them to pass
> > options for a particular algorithm? For example, if for a particular
> > setup an
> > option can change a lookup time, say, twice, comparing to the default,
> > then is
> > this worth to provide such a toggle? If we let users to provide options,
> > then
> > once we change the default algorithm, those options won't, most
> > probably, have
> > meaning for the new one.
> 
> > We actually can do both things here. For a "default" algorithm we won't
> > let
> > users to specify any non-generic, algorithm-specific, options. But if a
> > user
> > specifies a particular algorithm, then setting algorithm-specific
> > options will
> > be allowed.
> 
> Right, but again, it might be useful to do more research on what's
> missing from the generic bpf machinery to support your fast wildcard lookups
> by letting you implement some of these algorithms in bpf itself.
> 
> Because the result might be one of:
> 
> 1. There is x, y, z technical problems which will take years to solve;
> let's do this wildcard map for now; as bpf gains more and more features
> it will become obsolete or gain that BPF_WILDCARD_F_ALGORITHM_BPF for
> custom algos.
> 
> Or
> 
> 2. Looks like we can add this and that to the verifier in some
> reasonable amount of time and then we won't need a new map.
> 
> I'm also somewhat leaning on (1), but I'm still questioning your
> statement about 'these algorithms are [nearly?] impossible to implement
> in "pure" bpf'. Given enough effort this 'impossible' will at some point
> become possible. Bpf landscape is very different today compared to
> yesterday.

The current implementation provides a good generic solution (Tuple Merge
algorithm works pretty well, even without any additional configuration).  It
also provides a way to experiment with implementing a similar or better
algorithm in BPF (aka BPF_WILDCARD_F_ALGORITHM_BPF).

When I wrote "nearly impossible" phrase, I've meant, in the first place,
embedding the classifier inside existing programs.