Re: Moving from ipset to nftables: Sets not ready for prime time yet?

Stefano Brivio <sbrivio@xxxxxxxxxx> · Fri, 3 Jul 2020 15:38:27 +0200

Hi József,

On Fri, 3 Jul 2020 12:24:03 +0200 (CEST)
Jozsef Kadlecsik <kadlec@xxxxxxxxxxxxx> wrote:

> On Fri, 3 Jul 2020, Stefano Brivio wrote:
> 
> > On Fri,  3 Jul 2020 00:30:10 +0200 (CEST)
> > "Timo Sigurdsson" <public_timo.s@xxxxxxxxxxxxxx> wrote:
> >   
> > > Another issue I stumbled upon was that auto-merge may actually
> > > generate wrong/incomplete intervals if you have multiple 'add
> > > element' statements within an nftables script file. I consider this a
> > > serious issue if you can't be sure whether the addresses or intervals
> > > you add to a set actually end up in the set. I reported this here
> > > [2]. The workaround for it is - again - to add all addresses in a
> > > single statement.  
> > 
> > Practically speaking I think it's a bug, but I can't find a formal,
> > complete definition of automerge, so one can also say it "adds items up
> > to and including the first conflicting one", and there you go, it's
> > working as intended.
> > 
> > In general, when we discussed this "automerge" feature for
> > multi-dimensional sets in nftables (not your case, but I aimed at
> > consistency), I thought it was a mistake to introduce it altogether,
> > because it's hard to define it and whatever definition one comes up
> > with might not match what some users think. Consider this example:
> > 
> > # ipset create s hash:net,net
> > # ipset add s 10.0.1.1/30,192.168.1.1/24
> > # ipset add s 10.0.0.1/24,172.16.0.1
> > # ipset list s
> > [...]
> > Members:
> > 10.0.1.0/30,192.168.1.0/24
> > 10.0.0.0/24,172.16.0.1
> > 
> > good, ipset has no notion of automerge, so it won't try to do anything
> > bad here: the set of address pairs denoted by <10.0.1.1/30,  
> > 192.168.1.1/24> is disjoint from the set of address pairs denoted by  
> > <10.0.0.1/24, 172.16.0.1>. Then:
> > 
> > # ipset add s 10.0.0.1/16,192.168.0.0/16
> > # ipset list s
> > [...]
> > Members:
> > 10.0.1.0/30,192.168.1.0/24
> > 10.0.0.0/16,192.168.0.0/16
> > 10.0.0.0/24,172.16.0.1
> > 
> > and, as expected with ipset, we have entirely overlapping entries added
> > to the set. Is that a problem? Not really, ipset doesn't support maps,
> > so it doesn't matter which entry is actually matched.  
> 
> Actually, the flags, extensions (nomatch, timeout, skbinfo, etc.) in ipset 
> are some kind of mappings and do matter which entry is matched and which 
> flags, extensions are applied to the matching packets.

Oh, I didn't consider that.

> Therefore the matching in the net kind of sets follow a strict ordering: 
> most specific match wins and in the case of multiple dimensions (like 
> net,net above) it goes from left to right to find the best most specific 
> match.

And I didn't know about this either. Well, this looks a bit arbitrary
to me, also because there's no such thing as hash:port,net, so forcing
the left-to-right precedence won't cover all the possible cases anyway.

In nftables, as sets now support an arbitrary number of dimensions, in
an arbitrary order, that would require an explicit evaluation ordering,
which is actually not too hard to implement. I just doubt the usage
would be practical.

> > # nft add table t
> > # nft add set t s '{ type ipv4_addr . ipv4_addr; flags interval ; }'
> > # nft add element t s '{ 10.0.1.1/30 . 192.168.1.1/24 }'
> > # nft add element t s '{ 10.0.0.1/24 . 172.16.0.1 }'
> > # nft add element t s '{ 10.0.0.1/16 . 192.168.0.0/16 }'
> > # nft list ruleset
> > table ip t {
> > 	set s {
> > 		type ipv4_addr . ipv4_addr
> > 		flags interval
> > 		elements = { 10.0.1.0/30 . 192.168.1.0/24,
> > 			     10.0.0.0/24 . 172.16.0.1,
> > 			     10.0.0.0/16 . 192.168.0.0/16 }
> > 	}
> > }
> > 
> > also fine: the least generic entry is added first, so it matches first.
> > Let's try to reorder the insertions:
> > 
> > # nft add element t s '{ 10.0.0.1/16 . 192.168.0.0/16 }'
> > # nft add element t s '{ 10.0.0.1/24 . 172.16.0.1 }'
> > # nft add element t s '{ 10.0.1.1/30 . 192.168.1.1/24 }'
> > Error: Could not process rule: File exists
> > add element t s { 10.0.1.1/30 . 192.168.1.1/24 }
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> > ...because that entry would never match anything: it's inserted after a
> > more generic one that already covers it completely, and we'd like to
> > tell the user that it doesn't make sense.  
> 
> I think sets should not store information about which order the entries 
> were added. That should totally be indifferent. The input of the sets may 
> come from countless sources and if the order of adding the entries matters 
> then a preordering is required, which is sometimes non-trivial.

As it comes for free, I think it's nice to leave this possibility open
for simple combinations. It doesn't introduce any ambiguity. It's not
an usage I would recommend anyway, but I don't see the harm.

> > Now, this is pretty much the only advantage of not allowing overlaps:
> > telling the user that some insertion doesn't make sense, and thus it
> > was probably not what the user wanted to do.  
> 
> This makes also impossible to make exceptions in the sets in nftables - 
> with the "nomatch" flag in ipset one can easily create exceptions in 
> intentionally overlapping entries (in whatever deep nestings) in a single 
> set. In practice it comes quite handy to say
> 
> ipset create access_to_servers hash:ip,port,net
> ipset add access_to_servers your_ssh_server,22,x.y.z.0/24
> ipset add access_to_servers your_ssh_server,22,x.y.z.32/27 nomatch
> ...
> 
> and exclude access to some parts of a given subnet.
> 
> However, the internals of the sets in nftables are totally different from 
> ipset, so I'm pretty sure it's absolutely not trivial (and sometimes 
> impossible) to provide exactly the same behaviour.

It's actually kind of trivial for nft_set_pipapo, for nft_set_hash it
doesn't apply (it doesn't implement intervals), and I'm not sure about
nft_set_rbtree right now.

However, does this really provide any value compared to having a
separate set for exceptions matched earlier in a chain?

If it really does, I think it could and should be done in userspace by
splitting the intervals. The kernel back-ends shouldn't be overloaded
with complexity that doesn't *need* to live there, and no matter what,
this is going to have a performance impact on the lookup (it should be
doable to avoid an explicit branch for this, but we can't avoid
fetching more bits per element).

Ideally, I would even like to drop the need for timeout and validity
checks as part of the lookup, because they are quite heavy (fetching
the 'extension' pointer, branches, etc.). It involves some internal API
refactoring and is actually on my motionless to-do list, but too far
from the surface to have any practical value.

-- 
Stefano