On 09/08, Anton Protopopov wrote:
On 22/09/07 04:09, sdf@xxxxxxxxxx wrote:
> [sorry for the noise, got a bounce from the list, resend with the
> message trimmed]
>
> On 09/07, Anton Protopopov wrote:
> > Add a new map, BPF_MAP_TYPE_WILDCARD, which provides means to
implement
> > generic
> > online packet classification. Here "online" stands for "fast lookups
and
> > fast
> > updates", and "generic" means that a user can create maps with custom
> > lookup
> > schemes—different numbers of fields and different interpretation of
> > individual
> > fields (prefix bitmasks, ranges, and direct matches).
>
> > In particular, in Cilium we have several use cases for such a map:
>
> > * XDP Prefilter is a simple XDP-based DDoS mitigation system
provided by
> > Cilium. At the moment it only supports filtering by source CIDRs.
> > It would
> > benefit from using this new map, as it allows to utilize wildcard
> > rules
> > without big penalty comparing to one hash or LPM lookup utilized
now.
>
> > * XDP Packet Recorder (see
> > https://lpc.events/event/11/contributions/953/)
>
> > * K8S and Cilium Network Policies: as of k8s 1.25 port ranges are
> > considered
> > to be a stable feature, and this map allows to implement this
> > easily (and
> > also to provide more sophisticated filtering for Cilium Network
> > Policies)
>
> > Keys for wildcard maps are defined using the struct wildcard_key
> > structure.
> > Besides the type field, it contains the data field of an arbitrary
size.
> > To
> > educate map about what's contained inside the data field, two
additional
> > structures are used. The first one, struct wildcard_desc, also of
> > arbitrary
> > size, tells how many fields are contained inside data, and the struct
> > wildcard_rule_desc structure defines how individual fields look like.
>
> > Fields (rules) can be of three types:
> > BPF_WILDCARD_RULE_{PREFIX,RANGE,MATCH}.
>
> [..]
>
> > The PREFIX rule means that inside data we have a binary value and a
binary
> > (prefix) mask:
>
> > size u32
> > <----------------> <----------->
> > ... | rule value | prefix | ...
>
> > Here rule value is a binary value, e.g., 123.324.128.0, and prefix is
a
> > u32 bit
> > variable; we only use lower 8 bits of it. We allow 8, 16, 32, 64, and
> > 128 bit
> > values for PREFIX rules.
>
> I haven't looked at the code, so pardon stupid questions. This sounds
> like optimized LPM-trie?
>
> If not, what's the difference?
First, this map provides more generic API than LPM. Namely, we allow not
one,
but multiple prefixes (say, to specify both source and destination
prefix), and
also not only prefixes, but ranges. See the example I've just posted in
reply
to Daniel, but in short, we can specify rules like
(192.68.0.0/16, 10.0.0.0/24, *, 22)
We are also not limited by 4-tuples, we can create any combination of
rules.
Ah, ok, that seems useful. But fundamentally looks like a
map-in-a-map-in-a-map-in-a-map? So, in theory, at least the ip prefix
lookups
already can be implemented with LPM (albeit slower)?
Second, the [tuple merge implementation of this] map uses hash tables to
do
lookups, not tries.
I also should have mentioned that I have a talk on Plumbers next Tuesday
about
this map, I will talk about API and algorithms there, and provide some
numbers.
See https://lpc.events/event/16/contributions/1356/
Awesome, hopefully I'll be there as well, will come say hi :-)
From that link, regarding the following:
Thus, both of algorithms are [nearly?] impossible to implement in "pure"
BPF due to lack of functionality and also due to verifier complexity
limits.
There is a natural question here: what's missing? Can we extend the
bpf machinery to lift some of these restrictions? Verifier complexity
limits are pretty different now compared to even 2 years ago..
> If yes, can this be special cased/optimized in the existing LPM-trie
> optimization? I think we've tried in the past, mostly gave up and
> "fixed" by caching the state in the socket local storage.
>
> Also, fixed 8/16/32/64 prefixes seems like a big limitation? At least if
> I were to store ipv4 from the classless (cidr) world..
This is not for prefixes, but for the field lengths. The basic use case
for
this map is to filter packets, so we can create 4- or 5-tuple IPv4 or IPv6
maps. In the first case our field lengths will be (32, 32, 16, 16), in the
second - (128, 128, 16, 16). Prefixes can be of any length from 0 up to
the
field size.
Understood, thx.
> > The RANGE rule is determined by two binary values: minimum and
maximum,
> > treated
> > as unsigned integers of appropriate size:
>
> > size size
> > <----------------> <---------------->
> > ... | min rule value | max rule value | ...
>
> > We only allow the 8, 16, 32, and 64-bit for RANGE rules.
>
> That seems useful. I was thinking about similar 'rangemap' where
> we can effectively store and lookup [a,b] ranges. Might be useful
> in firewalls for storing lists of ports efficiently. So why not
> a separate map instead? Are you able to share a lot of code with
> the general matching map?
Yes, all is included inside one map. For firewalls users can create
individual
maps with different specifications. Say, one can filter 4-tuples, or
5-tuples,
or (src, dst, dst port), or (flow-label, dest port), etc.
> > The MATCH rule is determined by one binary value, and is basically the
> > same as
> > (X,sizeof(X)*8) PREFIX rule, but can be processed a bit faster:
>
> > size
> > <---------------->
> > ... | rule value | ...
>
> > To speed up processing all the rules, including the prefix field,
should
> > be
> > stored in host byte order, and all elements in network byte order.
16-byte
> > fields are stored as {lo,hi}—lower eight bytes, then higher eight
bytes.
>
> > For elements only values are stored.
>
> Can these optimization be auto-magically derived whenever PREFIX map
> with the right values is created? Why put this decision on the users?
Thanks for pointing this out. I am not really proud of this particular
interface...
For packet filtering values tend to appear in network byte order, so we
can
just assume this. We definitely can optimize values to the right order for
rules internally when bpf_map_update_elem is called.
We also can add a map flag to process values in host byte order. This can
be
helpful for a non-networking usage, when values appear in host byte order
naturally.
Yeah, it seems like the less uapi we export - the better. Maybe some of
that endiannes can be deduced via BTF (be32 vs u32) so we can
transparently apply these optimizations?
> > To simplify definition of key structures, the
> > BPF_WILDCARD_DESC_{1,2,3,4,5}
> > macros should be used. For example, one can define an IPv4 4-tuple
keys as
> > follows:
>
> > BPF_WILDCARD_DESC_4(
> > capture4_wcard,
> > BPF_WILDCARD_RULE_PREFIX, __u32, saddr,
> > BPF_WILDCARD_RULE_PREFIX, __u32, daddr,
> > BPF_WILDCARD_RULE_RANGE, __u16, sport,
> > BPF_WILDCARD_RULE_RANGE, __u16, dport
> > );
>
> > This macro will define the following structure:
>
> > struct capture4_wcard_key {
> > __u32 type;
> > __u32 priority;
> > union {
> > struct {
> > __u32 saddr;
> > __u32 saddr_prefix;
> > __u32 daddr;
> > __u32 daddr_prefix;
> > __u16 sport_min;
> > __u16 sport_max;
> > __u16 dport_min;
> > __u16 dport_max;
> > } __packed rule;
> > struct {
> > __u32 saddr;
> > __u32 daddr;
> > __u16 sport;
> > __u16 dport;
> > } __packed;
> > };
> > } __packed;
>
> > Here type field should contain either BPF_WILDCARD_KEY_RULE or
> > BPF_WILDCARD_KEY_ELEM so that kernel can differentiate between rules
and
> > elements. The rule structure is used to define (and lookup) rules, the
> > unnamed
> > structure can be used to specify elements when matching them with
rules.
>
> > In order to simplify definition of a corresponding struct
wildcard_desc,
> > the
> > BPF_WILDCARD_DESC_* macros will create yet another structure:
>
> > struct capture4_wcard_desc {
> > __uint(n_rules, 4);
> > struct {
> > __uint(type, BPF_WILDCARD_RULE_PREFIX);
> > __uint(size, sizeof(__u32));
> > } saddr;
> > struct {
> > __uint(type, BPF_WILDCARD_RULE_PREFIX);
> > __uint(size, sizeof(__u32));
> > } daddr;
> > struct {
> > __uint(type, BPF_WILDCARD_RULE_RANGE);
> > __uint(size, sizeof(__u16));
> > } sport;
> > struct {
> > __uint(type, BPF_WILDCARD_RULE_RANGE);
> > __uint(size, sizeof(__u16));
> > } dport;
> > };
>
> > This structure can be used in a (BTF) map definition as follows:
>
> > __type(wildcard_desc, struct capture4_wcard_desc);
>
> > Then libbpf will create a corresponding struct wildcard_desc and pass
it
> > to
> > kernel in bpf_attr using new map_extra_data/map_extra_data_size
fields.
>
> [..]
>
> > The map implementation allows users to specify which algorithm to use
to
> > store
> > rules and lookup packets. Currently, three algorithms are supported:
>
> > * Brute Force (suitable for map sizes of about 32 or below
elements)
>
> > * Tuple Merge (a variant of the Tuple Merge algorithm described in
the
> > "TupleMerge: Fast Software Packet Processing for Online Packet
> > Classification" white paper, see
> > https://nonsns.github.io/paper/rossi19ton.pdf.
> > The Tuple Merge algorithm is not protected by any patents.)
>
> > * Static Tuple Merge (a variant of Tuple Merge where a set of
lookup
> > tables
> > is directly provided by a user)
>
> As a user that has no clue how this map works, how do I decide which
> algorithm to use? What I like with the current maps is that they don't
> leak too much of their inner state. These controls seems a bit low
> level?
The idea to let users to select an algorithm appeared because the map is
pretty
generic, and some algorithms may work better for some cases. You're right
that
there shouldn't be need to specify algorithm. (In fact, users can omit the
algorithm flag now, and the default algorithm will be used. I also marked
to
myself to setup the right default algorithm, as now this seems to be the
brute
force...).
The reason I ask is: it seems it's just a matter of time until there will be
an algorithm called BPF_WILDCARD_F_ALGORITHM_BPF where you can pass a BPF
program that implements a custom one :-)