On 22/09/09 11:55, sdf@xxxxxxxxxx wrote: > On 09/09, Anton Protopopov wrote: > > On 22/09/08 03:37, sdf@xxxxxxxxxx wrote: > > > On 09/08, Anton Protopopov wrote: > > > > On 22/09/07 04:09, sdf@xxxxxxxxxx wrote: > > > > > [sorry for the noise, got a bounce from the list, resend with the > > > > > message trimmed] > > > > > > > > > > On 09/07, Anton Protopopov wrote: > > > > > > Add a new map, BPF_MAP_TYPE_WILDCARD, which provides means to > > > > implement > > > > > > generic > > > > > > online packet classification. Here "online" stands for "fast > > lookups > > > > and > > > > > > fast > > > > > > updates", and "generic" means that a user can create maps with > > custom > > > > > > lookup > > > > > > schemes—different numbers of fields and different interpretation > > of > > > > > > individual > > > > > > fields (prefix bitmasks, ranges, and direct matches). > > > > > > > > > > > In particular, in Cilium we have several use cases for such a map: > > > > > > > > > > > * XDP Prefilter is a simple XDP-based DDoS mitigation system > > > > provided by > > > > > > Cilium. At the moment it only supports filtering by source > > CIDRs. > > > > > > It would > > > > > > benefit from using this new map, as it allows to utilize > > wildcard > > > > > > rules > > > > > > without big penalty comparing to one hash or LPM lookup > > > > utilized now. > > > > > > > > > > > * XDP Packet Recorder (see > > > > > > https://lpc.events/event/11/contributions/953/) > > > > > > > > > > > * K8S and Cilium Network Policies: as of k8s 1.25 port ranges > > are > > > > > > considered > > > > > > to be a stable feature, and this map allows to implement this > > > > > > easily (and > > > > > > also to provide more sophisticated filtering for Cilium > > Network > > > > > > Policies) > > > > > > > > > > > Keys for wildcard maps are defined using the struct wildcard_key > > > > > > structure. > > > > > > Besides the type field, it contains the data field of an arbitrary > > > > size. > > > > > > To > > > > > > educate map about what's contained inside the data field, two > > > > additional > > > > > > structures are used. The first one, struct wildcard_desc, also of > > > > > > arbitrary > > > > > > size, tells how many fields are contained inside data, and the > > struct > > > > > > wildcard_rule_desc structure defines how individual fields look > > like. > > > > > > > > > > > Fields (rules) can be of three types: > > > > > > BPF_WILDCARD_RULE_{PREFIX,RANGE,MATCH}. > > > > > > > > > > [..] > > > > > > > > > > > The PREFIX rule means that inside data we have a binary value > > and a > > > > binary > > > > > > (prefix) mask: > > > > > > > > > > > size u32 > > > > > > <----------------> <-----------> > > > > > > ... | rule value | prefix | ... > > > > > > > > > > > Here rule value is a binary value, e.g., 123.324.128.0, and prefix > > > > is a > > > > > > u32 bit > > > > > > variable; we only use lower 8 bits of it. We allow 8, 16, 32, > > 64, and > > > > > > 128 bit > > > > > > values for PREFIX rules. > > > > > > > > > > I haven't looked at the code, so pardon stupid questions. This > > sounds > > > > > like optimized LPM-trie? > > > > > > > > > > If not, what's the difference? > > > > > > > First, this map provides more generic API than LPM. Namely, we allow > > not > > > > one, > > > > but multiple prefixes (say, to specify both source and destination > > > > prefix), and > > > > also not only prefixes, but ranges. See the example I've just posted > > in > > > > reply > > > > to Daniel, but in short, we can specify rules like > > > > > > > (192.68.0.0/16, 10.0.0.0/24, *, 22) > > > > > > > We are also not limited by 4-tuples, we can create any combination of > > > > rules. > > > > > > Ah, ok, that seems useful. But fundamentally looks like a > > > map-in-a-map-in-a-map-in-a-map? So, in theory, at least the ip prefix > > > lookups > > > already can be implemented with LPM (albeit slower)? > > > Well, this is just a single map :) > > > We have a use case for one prefix lookups in Cilium. The XDP prefilter > > uses > > either a hash or an LPM (hash is used if only precise match rules > > specified, > > and LPM - if there are CIDR rules): > > > https://github.com/cilium/cilium/blob/master/bpf/bpf_xdp.c#L60-L68 > > > Here the new map works faster than LPM (e.g., for 5000K rules this is > > about > > 120ns vs. 80ns for an [IPv6] map lookup on AMD Ryzen 9). With the new > > map we > > also will be able to specify more sophisticated rules for the XDP > > prefilter, > > not only source CIDRs. > > Ack, I'm just trying to understand whether decomposing this into smaller > units might be beneficial. I don't think I have an answer, only asking > questions :-) Sure, thanks for the questions! For the smaller pieces, this is possible to just configure a "one-range" map or "one-prefix" map. E.g., in Cilium we have a usecase where we need a "(u32 ID, u16 range)" map (for security id + port range). > > > > Second, the [tuple merge implementation of this] map uses hash > > tables to > > > > do > > > > lookups, not tries. > > > > > > > I also should have mentioned that I have a talk on Plumbers next > > Tuesday > > > > about > > > > this map, I will talk about API and algorithms there, and provide some > > > > numbers. > > > > See https://lpc.events/event/16/contributions/1356/ > > > > > > Awesome, hopefully I'll be there as well, will come say hi :-) > > > Unluckily, I will be remote this time > > Still looking forward to the talk! Hopefully you'll get some useful > feedback out of it. > > > > From that link, regarding the following: > > > Thus, both of algorithms are [nearly?] impossible to implement in > > "pure" > > > BPF due to lack of functionality and also due to verifier complexity > > > limits. > > > > > > There is a natural question here: what's missing? Can we extend the > > > bpf machinery to lift some of these restrictions? Verifier complexity > > > limits are pretty different now compared to even 2 years ago.. > > > The complexity, in fact, is a big concern for us in Cilium - we already > > have > > rather large programs. Adding, at least, a few hundreds lines vs. just > > doing a > > single map lookup is a big deal. (We also have to support older kernels.) > > > The Tuple Merge implementation uses a [custom] hash table underneath. The > > difference with the normal hash table is that we actually have multiple > > hash > > tables inside one - we split rules in several "tables" and do a hash > > lookup for > > each table (and to resolve collisions hash table elements know to which > > table > > they do belong). > > Agreed, the tradeoff is basically: > > - solve your "narrow" usecase > - solve a bigger problem that will allow you to solve your usecase as > well (say, write all those algorithms in bpf; that won't work for the > older kernels, but it will lessen the maintenance burden on the > kernel) > > I'm not sure where that balance is in this particular case. Maybe let's > get back to discussion after the LPC, hopefully I'll get more context. > > > > > > If yes, can this be special cased/optimized in the existing LPM-trie > > > > > optimization? I think we've tried in the past, mostly gave up and > > > > > "fixed" by caching the state in the socket local storage. > > > > > > > > > > Also, fixed 8/16/32/64 prefixes seems like a big limitation? At > > least if > > > > > I were to store ipv4 from the classless (cidr) world.. > > > > > > > This is not for prefixes, but for the field lengths. The basic use > > case > > > > for > > > > this map is to filter packets, so we can create 4- or 5-tuple IPv4 > > or IPv6 > > > > maps. In the first case our field lengths will be (32, 32, 16, 16), > > in the > > > > second - (128, 128, 16, 16). Prefixes can be of any length from 0 up > > to > > > > the > > > > field size. > > > > > > Understood, thx. > > > > > > > > > The RANGE rule is determined by two binary values: minimum and > > > > maximum, > > > > > > treated > > > > > > as unsigned integers of appropriate size: > > > > > > > > > > > size size > > > > > > <----------------> <----------------> > > > > > > ... | min rule value | max rule value | ... > > > > > > > > > > > We only allow the 8, 16, 32, and 64-bit for RANGE rules. > > > > > > > > > > That seems useful. I was thinking about similar 'rangemap' where > > > > > we can effectively store and lookup [a,b] ranges. Might be useful > > > > > in firewalls for storing lists of ports efficiently. So why not > > > > > a separate map instead? Are you able to share a lot of code with > > > > > the general matching map? > > > > > > > Yes, all is included inside one map. For firewalls users can create > > > > individual > > > > maps with different specifications. Say, one can filter 4-tuples, or > > > > 5-tuples, > > > > or (src, dst, dst port), or (flow-label, dest port), etc. > > > > > > > > > The MATCH rule is determined by one binary value, and is > > basically the > > > > > > same as > > > > > > (X,sizeof(X)*8) PREFIX rule, but can be processed a bit faster: > > > > > > > > > > > size > > > > > > <----------------> > > > > > > ... | rule value | ... > > > > > > > > > > > To speed up processing all the rules, including the prefix field, > > > > should > > > > > > be > > > > > > stored in host byte order, and all elements in network byte order. > > > > 16-byte > > > > > > fields are stored as {lo,hi}—lower eight bytes, then higher eight > > > > bytes. > > > > > > > > > > > For elements only values are stored. > > > > > > > > > > Can these optimization be auto-magically derived whenever PREFIX map > > > > > with the right values is created? Why put this decision on the > > users? > > > > > > > Thanks for pointing this out. I am not really proud of this particular > > > > interface... > > > > > > > For packet filtering values tend to appear in network byte order, so > > we > > > > can > > > > just assume this. We definitely can optimize values to the right > > order for > > > > rules internally when bpf_map_update_elem is called. > > > > > > > We also can add a map flag to process values in host byte order. This > > > > can be > > > > helpful for a non-networking usage, when values appear in host byte > > order > > > > naturally. > > > > > > Yeah, it seems like the less uapi we export - the better. Maybe some of > > > that endiannes can be deduced via BTF (be32 vs u32) so we can > > > transparently apply these optimizations? > > > I am not sure how to do this automatically. Say, users can define an IPv4 > > address as __u32, or __u8[4], or __be32. And they can also legally mix the > > definitions, as bpf(2) only says "gimme a void *key". > > You don't have to solve it for everything: > __u32 and __u8[4] won't get those optimizations, but __be32/__be8[4] will. > (assuming endianness signal can be taken from the key's BTF if it preserves > __be32 vs __u32 distinction). > > And then, somewhere in the documentation, you can say: the preferred way > to store ip addresses is __be32; this might (or may not) deliver better > performance in some cases. > > This way, we are free to add more heuristics without having a written > contract with the users about this extra mode. At some point, > we might figure out how to make this work more efficiently for regular __u32 > as well. > > (pardon me if I'm spilling some nonsense, maybe we should really get > back to this after LPC) Yes, in the best case I will get some feedback about use cases other people have. > > Users don't have problems with specifying, say, LPM keys or socket > > options in > > network byte order, so can we just say: use network byte order for your > > values? > > > > > > > To simplify definition of key structures, the > > > > > > BPF_WILDCARD_DESC_{1,2,3,4,5} > > > > > > macros should be used. For example, one can define an IPv4 4-tuple > > > > keys as > > > > > > follows: > > > > > > > > > > > BPF_WILDCARD_DESC_4( > > > > > > capture4_wcard, > > > > > > BPF_WILDCARD_RULE_PREFIX, __u32, saddr, > > > > > > BPF_WILDCARD_RULE_PREFIX, __u32, daddr, > > > > > > BPF_WILDCARD_RULE_RANGE, __u16, sport, > > > > > > BPF_WILDCARD_RULE_RANGE, __u16, dport > > > > > > ); > > > > > > > > > > > This macro will define the following structure: > > > > > > > > > > > struct capture4_wcard_key { > > > > > > __u32 type; > > > > > > __u32 priority; > > > > > > union { > > > > > > struct { > > > > > > __u32 saddr; > > > > > > __u32 saddr_prefix; > > > > > > __u32 daddr; > > > > > > __u32 daddr_prefix; > > > > > > __u16 sport_min; > > > > > > __u16 sport_max; > > > > > > __u16 dport_min; > > > > > > __u16 dport_max; > > > > > > } __packed rule; > > > > > > struct { > > > > > > __u32 saddr; > > > > > > __u32 daddr; > > > > > > __u16 sport; > > > > > > __u16 dport; > > > > > > } __packed; > > > > > > }; > > > > > > } __packed; > > > > > > > > > > > Here type field should contain either BPF_WILDCARD_KEY_RULE or > > > > > > BPF_WILDCARD_KEY_ELEM so that kernel can differentiate between > > rules > > > > and > > > > > > elements. The rule structure is used to define (and lookup) > > rules, the > > > > > > unnamed > > > > > > structure can be used to specify elements when matching them with > > > > rules. > > > > > > > > > > > In order to simplify definition of a corresponding struct > > > > wildcard_desc, > > > > > > the > > > > > > BPF_WILDCARD_DESC_* macros will create yet another structure: > > > > > > > > > > > struct capture4_wcard_desc { > > > > > > __uint(n_rules, 4); > > > > > > struct { > > > > > > __uint(type, BPF_WILDCARD_RULE_PREFIX); > > > > > > __uint(size, sizeof(__u32)); > > > > > > } saddr; > > > > > > struct { > > > > > > __uint(type, BPF_WILDCARD_RULE_PREFIX); > > > > > > __uint(size, sizeof(__u32)); > > > > > > } daddr; > > > > > > struct { > > > > > > __uint(type, BPF_WILDCARD_RULE_RANGE); > > > > > > __uint(size, sizeof(__u16)); > > > > > > } sport; > > > > > > struct { > > > > > > __uint(type, BPF_WILDCARD_RULE_RANGE); > > > > > > __uint(size, sizeof(__u16)); > > > > > > } dport; > > > > > > }; > > > > > > > > > > > This structure can be used in a (BTF) map definition as follows: > > > > > > > > > > > __type(wildcard_desc, struct capture4_wcard_desc); > > > > > > > > > > > Then libbpf will create a corresponding struct wildcard_desc and > > > > pass it > > > > > > to > > > > > > kernel in bpf_attr using new map_extra_data/map_extra_data_size > > > > fields. > > > > > > > > > > [..] > > > > > > > > > > > The map implementation allows users to specify which algorithm to > > > > use to > > > > > > store > > > > > > rules and lookup packets. Currently, three algorithms are > > supported: > > > > > > > > > > > * Brute Force (suitable for map sizes of about 32 or below > > > > elements) > > > > > > > > > > > * Tuple Merge (a variant of the Tuple Merge algorithm described > > > > in the > > > > > > "TupleMerge: Fast Software Packet Processing for Online > > Packet > > > > > > Classification" white paper, see > > > > > > https://nonsns.github.io/paper/rossi19ton.pdf. > > > > > > The Tuple Merge algorithm is not protected by any patents.) > > > > > > > > > > > * Static Tuple Merge (a variant of Tuple Merge where a set of > > > > lookup > > > > > > tables > > > > > > is directly provided by a user) > > > > > > > > > > As a user that has no clue how this map works, how do I decide which > > > > > algorithm to use? What I like with the current maps is that they > > don't > > > > > leak too much of their inner state. These controls seems a bit low > > > > > level? > > > > > > > The idea to let users to select an algorithm appeared because the > > map is > > > > pretty > > > > generic, and some algorithms may work better for some cases. You're > > > > right that > > > > there shouldn't be need to specify algorithm. (In fact, users can > > omit the > > > > algorithm flag now, and the default algorithm will be used. I also > > > > marked to > > > > myself to setup the right default algorithm, as now this seems to be > > the > > > > brute > > > > force...). > > > > > > The reason I ask is: it seems it's just a matter of time until there > > will be > > > an algorithm called BPF_WILDCARD_F_ALGORITHM_BPF where you can pass a > > BPF > > > program that implements a custom one :-) > > > ... which fits the current implementation :) > > > If we don't let users to specify an algorithm, then do we let them to pass > > options for a particular algorithm? For example, if for a particular > > setup an > > option can change a lookup time, say, twice, comparing to the default, > > then is > > this worth to provide such a toggle? If we let users to provide options, > > then > > once we change the default algorithm, those options won't, most > > probably, have > > meaning for the new one. > > > We actually can do both things here. For a "default" algorithm we won't > > let > > users to specify any non-generic, algorithm-specific, options. But if a > > user > > specifies a particular algorithm, then setting algorithm-specific > > options will > > be allowed. > > Right, but again, it might be useful to do more research on what's > missing from the generic bpf machinery to support your fast wildcard lookups > by letting you implement some of these algorithms in bpf itself. > > Because the result might be one of: > > 1. There is x, y, z technical problems which will take years to solve; > let's do this wildcard map for now; as bpf gains more and more features > it will become obsolete or gain that BPF_WILDCARD_F_ALGORITHM_BPF for > custom algos. > > Or > > 2. Looks like we can add this and that to the verifier in some > reasonable amount of time and then we won't need a new map. > > I'm also somewhat leaning on (1), but I'm still questioning your > statement about 'these algorithms are [nearly?] impossible to implement > in "pure" bpf'. Given enough effort this 'impossible' will at some point > become possible. Bpf landscape is very different today compared to > yesterday. The current implementation provides a good generic solution (Tuple Merge algorithm works pretty well, even without any additional configuration). It also provides a way to experiment with implementing a similar or better algorithm in BPF (aka BPF_WILDCARD_F_ALGORITHM_BPF). When I wrote "nearly impossible" phrase, I've meant, in the first place, embedding the classifier inside existing programs.