Re: [PATCH,nf-next RFC 0/2] add NFTA_SET_ELEM_KEY_END

Stefano Brivio <sbrivio@xxxxxxxxxx> · Tue, 3 Dec 2019 16:56:42 +0100

On Tue, 3 Dec 2019 12:02:54 +0100
Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> wrote:

> Hi Stefano,
> 
> On Mon, Dec 02, 2019 at 05:19:52PM +0100, Stefano Brivio wrote:
> [...]
> > On Mon,  2 Dec 2019 14:14:05 +0100
> > Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> wrote:  
> [...]
> > > This patchset extends the netlink API to allow to express an interval
> > > with one single element.
> > > 
> > > This simplifies this interface since userspace does not need to send two
> > > independent elements anymore, one of the including the
> > > NFT_SET_ELEM_INTERVAL_END flag.
> > > 
> > > The idea is to use the _DESC to specify that userspace speaks the kernel
> > > that new API representation. In your case, the new description attribute
> > > that tells that this set contains interval + concatenation implicitly
> > > tells the kernel that userspace supports for this new API.  
> > 
> > Thanks! I just had a quick look, I think the new set implementation
> > would indeed look more elegant this way. As to design choices, I'm
> > afraid I'm not familiar enough with the big picture to comment on the
> > general idea, but my uninformed opinion agrees with this approach. :)
> > 
> > For what it's worth, I'd review this in deeper detail next.  
> 
> Thanks.
> 
> > > If you're fine with this, I can scratch a bit of time to finish the
> > > libnftnl part. The nft code will need a small update too. You will not
> > > need to use the nft_set_pipapo object as scratchpad area anymore.  
> > 
> > On my side, I'm almost done with nft/libnftnl/kernel changes for the
> > NFT_SET_DESC_CONCAT thing. How should we proceed? Do you want me to
> > share those patches so that you can add this bit on top, or should this
> > come first, or in a separate series?  
> 
> My suggestion is that you can take them and place them at the
> beginning of your batch since it will be the first client for this new
> netlink attribute, you will have to adapt pipapo to use the new
> key_end value too.

Okay, then I'll adjust the NFT_SET_DESC_CONCAT changes on top of your
changes.

> > I could also just share the new nft/libnftnl patches (I should have them
> > ready between today and tomorrow), and proceed adapting the kernel part
> > according to your changes.  
> 
> I still have to send you the libnftnl part for this.

If that's the order of changes you suggest, yes. Anyway, I guess I can
already proceed adjusting the kernel part meanwhile.

> > Related question: to avoid copying data around, I'm now dynamically
> > allocating a struct nft_data_desc in nf_tables_newset() with a
> > reference from struct nft_set: desc->dlen, desc->klen, desc->size would
> > all live there, together with the "subkey" stuff.
> > 
> > Is it a bad idea? I can undo it easily, I just don't know if there's a
> > specific reason why those fields are repeated in struct nft_set.  
> 
> Not sure I understand, probably some code sketch? From your words it
> does not look like a major issue though, but let me know.

Right now, nf_tables_newset() in nf_tables_api.c has:

	struct nft_set_desc desc;
[...]
	memset(&desc, 0, sizeof(desc));
[...]
	desc.klen = ntohl(nla_get_be32(nla[NFTA_SET_KEY_LEN]));
[and so on]

then values from desc are copied one by one into struct nft_set *set,
later in the same function:

	set->klen  = desc.klen;
	set->dlen  = desc.dlen;
	set->size  = desc.size;

and I don't see the point (well, we avoid one further dereference in
e.g. __nft_rbtree_lookup(), but is that worth it?). As I'm adding two
fields (field_len[] and field_count) to struct nft_set_desc, I would
instead allocate it dynamically:

	struct nlattr *desc;
[...]
	desc = kzalloc(sizeof(*desc), GFP_KERNEL);

fill it:
	desc->klen = ntohl(nla_get_be32(nla[NFTA_SET_KEY_LEN]));
[and so on]

and then store a reference to it in struct nft_set *set:
	set->desc   = desc;

instead of copying fields one by one, and having their declaration
duplicated in struct nft_set and struct nft_set_desc. I can also stick
to the current way, that is, add those fields to both struct nft_set
and struct nft_set_desc, and copy them.

It's mostly a matter of taste, unless there's a specific reason why we
shouldn't point to a struct nft_set_desc from struct nft_set.

> BTW, there is also one more pending issue: I can see there is a clone
> point in nft_set_pipapo, you mentioned some problems to make things
> fit in into the transaction infrastructure. Could you describe how you
> integrate with it?

Currently, there are no problems, and the description of how relevant
operations integrate with the API is in the kerneldoc comments for
nft_pipapo_activate(), _deactivate(), _flush(), _remove(). While the
operations are, per se, conceptually not atomic, atomicity is then
guaranteed by the genmask mechanism.

The problems would start once we want to add some specific
optimisations that rely on the fact there are no (entirely) overlapping
entries in current, live matching data. If we could rely on that, and
we opt to opportunistically coalesce per-field classifier regions, we
can reduce the complexity of some steps. Let's say we have a set
matching, separately, two entries:

 - 192.168.1.1:1024 -> x
 - 192.168.1.1:2048 -> y

then in the classifier region for the IPv4 address field we could
naturally have a single rule, that, if matched, would lead to
evaluation of the port bits against both 1024 and 2048, which would
then map to two different entries.

Let's name:
 - "192.168.1.1" rule #0 for the IPv4 address
 - "1024" rule #0 for the port (mapping to x)
 - "2048" rule #1 for the port (mapping to y)

Then rule #0 for the IPv4 address activates processing of rule #0 and
rule #1 for the port bits, and at each step we select up to one single
rule, that results in up to one single match.

However, we can have this situation:
 - 192.168.1.1:1024 -> x (active element reference)
 - 192.168.1.1:2048 -> y (active element reference)
 - 192.168.1.1:1024 -> z (inactive element reference)

so we need:
 - "1024" rule #0 for the port (mapping to x)
 - "2048" rule #1 for the port (mapping to y)
 - "1024" rule #2 for the port (mapping to z)

and once we have evaluated the port field, we might have n > 1 rules
selected, so we might need to check multiple elements before returning
a match, and this introduces complexity in the fast path. We also need
the nft_set_elem_active() check, which is quite expensive.

I wouldn't introduce this kind of optimisation right now, mostly because
of the complexity it adds for listing and deletion, and I'd rather
prefer the basic implementation to be reasonably mature before
proceeding with this.

In general, I find the current API a bit unnatural. As I mentioned, I
would have expected to implement a single insert operation, a single
delete operation, and a commit routine, instead of five operations that
can't be clearly defined in terms of a typical transaction/commit
model. I think that a clear distinction between insert/delete and
commit would also more naturally fit the publish/subscribe RCU model.

If we had a single commit callback, there couldn't be entirely
overlapping entries in live matching data, because if any such entry is
present in the commit phase, the commit would fail.

On top of that, the flush() operation would actually correspond to its
name.

> Probably there is a chance to extend the front-end
> API too to make it easier for pipapo.

I think that any change going in this direction, whether desirable or
not, is going to be quite invasive. In particular, I would first try to
figure out pros and cons for existing set implementations.

Some set implementations would probably benefit from having a
completely separated copy of matching data where pending operations are
performed. In general, we would be able to skip the
nft_set_elem_active() check in lookup functions, which looks expensive
in case the NFT_SET_MAP flag is not set, because just for that single
check we need to read data quite far away. Inserting elements directly
on live data might also result in some amount of false sharing, but I
wouldn't comment further on that without some actual test results.

In conclusion, I'd rather defer this kind of API change (again, if it
even makes sense) to a later time, I would start with some simplified
implementation to check how existing set types behave with it, and I
don't see any "quick" way to simplify pipapo's operations any further
right now.

-- 
Stefano