Re: [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload

Boris Sukholitko <boris.sukholitko@xxxxxxxxxxxx> · Thu, 11 May 2023 18:59:27 +0300

On Wed, May 10, 2023 at 02:55:44PM +0200, Florian Westphal wrote:
> Boris Sukholitko <boris.sukholitko@xxxxxxxxxxxx> wrote:
> > > table inet filter {
> > >         flowtable f1 {
> > >                 hook ingress priority filter
> > >                 devices = { veth0, veth1 }
> > >         }
> > > 
> > >         chain forward {
> > >                 type filter hook forward priority filter; policy accept;
> > >                 ip dscp set cs3
> > >                 ip protocol { tcp, udp, gre } flow add
> > >                 ct state established,related accept
> > >         }
> > > }
> > > ----------
> > > 
> > > This has a clearly defined meaning in all possible combinations.
> > > 
> > > Software:
> > > 1. It defines a bypass for veth0 <-> veth1
> > > 2. the way this specific ruleset is defined, all of tcp/udp/gre will
> > >    attempt to offload
> > 
> > OK.
> > 
> > > 3. once offload has happened, entire inet:forward may be bypassed
> > 
> > By bypassed, do you mean that chain forward ruleset body becomes
> > irrelevant?
> 
> Yes.  It becomes irrelevant because the entire ip forward stack
> becomes irrelevant. Prerouting and postrouting hooks get skipped
> too, would you restore those as well?
> 
> > If the unfortunate answer is yes, than as in the original
> > report, once we are in the software fast path we do not do dscp
> > modification, right?
> 
> Yes. Its expected.  Entire IP stack is bypassed, including MTU
> checks, xfrm policy lookups and so on.
> 
> > Then once we hit the NF_FLOW_TIMEOUT, some of the packets will have dscp
> > modified, because we've went out of software acceleration. I.e. some of
> > the packets will arrive with and some without dscp. As you can imagine,
> > this could be hard to debug. This is the scenario that the patch series
> > tries to fix.
> > 
> > > 4. User ruleset needs to cope with packets being moved back to
> > >    software: fragmented packets, tcp fin/rst, hw timeouts and so on.
> > 
> > Should we require our user to understand that some lines in their
> > forward table configuration may or may not be executed sporadically?
> 
> Yes.  They have to understand the packet is handled in a *completely
> different* way, the IP stack is bypassed, including ipsec/xfrm policies,
> icmp checks, mtu checks, and so on.
> 
> This is also why the fastpath *must* push some packets back to
> normal plane: so that packet passes through ip_forward() which does
> all this extra work.
> 
> Some packets cannot be added to flowtable either even if user asks
> for it, e.g. when ip options are enabled.
> 
> > Should we give the user at least some kind of warning regarding this
> > during the ruleset load?
> 
> You could try to do this for the 'nft -f' case, but I doubt its worth
> it, see below for an example.
> 
> > To be constructive, isn't it better to rephrase points 3 and 4 as:
> > 
> > 3. once offload has happened, entire inet:forward will be executed with
> > the same semantics but with better performance. Any difference between
> > fast and slow path is considered a bug.
> 
> Err. no.  Because its impossible to do unless you stick a
> nf_hook_slow() call into the "fastpath", otherwise you will diverge
> from what normal path is doing.  And its still not the same,
> feature-wise, because we e.g. cache route info rather than per-packet
> lookup.
> 
> Flowtable software path is the *fallback* for when we don't have offload
> capable hardware, it bypasses entire ip forward plane and packets take
> a different, shorter code path, iff possible.
> 
> > 4. If something in user ruleset (such as dscp rule now) precludes fast
> > path optimization then either error will be given or slow path will be
> > taken with a warning.
> 
> Yes, nftables userspace could be augmented so that 'nft -f' could
> display a warning if there is a rule other than 'conntrack
> established,related accept' or similar as first line.
> 
> But I don't think its worth doing, f.e. someone could be doing
> selective offload based on src/dst networks or similar.
> 
> Updating/expanding flowtable documentation would be welcome of course.
> 
> > > Lets now consider existing netdev:ingress/egress in this same picture:
> > > (Example from Pablo):
> > > ------
> > > table inet filter {
> > >         flowtable f1 {
> > >                 hook ingress priority filter
> > >                 devices = { veth0, veth1 }
> > >         }
> > > 
> > >         chain ingress {
> > >                 type filter hook ingress device veth0 priority filter; policy accept; flags offload;
> > >                 ip dscp set cs3
> > >         }
> > > 
> > >         chain forward {
> > >                 type filter hook forward priority filter; policy accept;
> > >                 meta l4proto { tcp, udp, gre } flow add @f1
> > >                 ct state established,related accept
> > >         }
> > > }
> > > 
> > > Again, this has defined meaning in all combinations:
> > > With HW offload: veth0 will be told to mangle dscp.
> > > This happens in all cases and for every matching packet,
> > > regardless if a flowtable exists or not.
> > > 
> > > Same would happen for 'egress', just that it would happen at xmit time
> > > rather at receive time.  Again, its not relevant if there is active
> > > flowtable or not, or if data path is offloaded to hardware, to software,
> > > handled by fallback or entirely without flowtables being present.
> > > 
> > > Its also clear that this is tied to 'veth0', other devices will
> > > not be involved and not do such mangling.
> > > 
> > 
> > As I've mentioned in my other reply to Pablo, our focus is exclusively
> > on the *software* fast path of the forward chain. In this scenario
> > getting into additional nftables VM path in the ingress or egress seems
> > like pessimization which we'd like to avoid.
> 
> You will have to add a call to nf_hook_slow, or at very least to
> nft_do_chain...
> 
> netdev:in/egress is the existing plumbing that we have that
> allows for this, and I think that this is what should be used
> here.
> 
> > > Now lets look at your proposal:
> > > ----------------
> > > table inet filter {
> > >         flowtable f1 {
> > >                 hook ingress priority filter
> > >                 devices = { veth0, veth1 }
> > >         }
> > > 
> > >         chain forward {
> > >                 type filter hook forward priority filter; policy accept;
> > >                 ip dscp set cs3 offload
> > >                 ip protocol { tcp, udp, gre } flow add
> > >                 ct state established,related accept
> > >         }
> > > }
> > > ----------------
> > > 
> > > This means that software flowtable offload
> > > shall do a 'ip dscp set cs3'.
> > > 
> > > What if the flowtable is offloaded to hardware
> > > entirely, without software fallback?
> > > 
> > > What if the devices listed in the flowtable definition can handle
> > > flow offload, but no payload mangling?
> > > 
> > > Does the 'offload' mean that the rule is only active for
> > > software path?  Only for hardware path? both?
> > > 
> > > How can I tell if its offloaded to hardware for one device
> > > but not for the other?  Or will that be disallowed?
> > > 
> > > What if someone adds another rule after or before 'ip dscp',
> > > but without the 'offload' keyword?  Now ordering becomes an
> > > issue.
> > > 
> > > Users now need to consider different control flows:
> > > 
> > >   jump exceptions
> > >   ip dscp set cs3 offload
> > > 
> > >   chain exceptions {
> > >     ip daddr 1.2.3.4 accept
> > >   }
> > > 
> > > This won't work as expected, because offloaded flows will not
> > > pass through 'forward' chain but somehow a few selected rules
> > > will be run anyway.
> > > 
> > > TL;DR: I think that for HW offload its paramount to make it crystal
> > > clear as to which device is responsible to handle such rules.
> > 
> > Your critique of my offload flag is well deserved and fully correct,
> > thanks! The reason for the dscp statement offload flag was to try to be
> > explicit about the need for payload modification capture.
> > 
> > What we've really wanted to do here is to make the payload capture
> > dependent on the chain flowtable acceleration status. I.e. if the
> > forward chain is supposed to be accelerated, than the payload
> > modification capture should happen. Being lazy, I've went through that
> > ugly keyword path. Apologies for that.
> 
> > Yes. But again hardware offload is irrelevant for the problem we have
> > here.
> 
> Unfortunately its not, we cannot make waters murkier and just pretend
> HW offloads don't exist.  Behavior for "flags offload;" presence or
> absence should be identical (in absence of bugs of course).
> 
> > > I don't think mixing software and hardware offload contexts as proposed
> > > is a good idea, both from user frontend syntax, clarity and error reporting
> > > (e.g. if hw rejects offload request) point of view.
> > > 
> > 
> > Frontend syntax and nft userspace should not be affected once we drop
> > the unneeded offload flag on dscp statement.
> 
> So you propose to software-offload all mangle statements and
> prune all other rules...?
> 
> How is that any better than what we do now?
> 
> > > I also believe that allowing payload mangling from *software* offload
> > > path sets a precedence for essentially allowing all other expressions
> > > again which completely negates the flowtable concept.
> > 
> > IMHO, the flowtable concept means transparent acceleration of packet
> > processing between the specified interfaces. If something in the ruleset
> > precludes such acceleration warnings/errors should be given.
> > 
> > Do you agree?
> 
> It would be nice to give a warning but I don't think its feasible.
> Consider something like
> 
>  chain forward {
>   ip dscp set cs3
>   ct state established,related accept
>   ip saddr @should_offload_nets flow add @ft
>  }
> 
> Should this generate a warning?
> 
> Also, because of forward path bypass even if the ruleset is just doing:
> 
> ct state established,related
> ip saddr @can_offload flow add @ft
> 
> Packet flow is not the same as without "flow add" for a large swath of
> packets, as no prerouting or postrouting hooks are executed either.
> 
> > Once the goal of significant performance gains is preserved, the
> > expansion of the universe of accelerated expressions is benign. And why
> > not? We are still being fast.
> 
> I still think that my critique stands, pruning other rules and
> magically considering some while completely disregarding their context
> is not a good idea.
> 

I think I finally understand your reasoning. May I summarise it as the
following:

nftables chain forward having a flow add clause becomes a request from
the user to skip parts of Linux network stack. The affected flows will
become special and unaffected by most of the rules of the "slowpath"
chain forward. This is a sharp tool and user gets to keep both pieces
if something breaks :)

I agree that this is a well-defined and defensible position. I concur.

> Now, theoretically, you could add this:
> 
> chain fast_fwd {
> 	hook flowtable @f1 prio 0
> 	ip dscp set cs3
> }
> 

Yes, I really like that. Here is what such chain will do:

1. On the slow path it will behave identical to the forward chain.
2. The only processing done on fast_fwd fast path is interpretation
   of struct flow_offload_entry list (.
3. Such fast path is done between devices defined in flowtable f1
4. Apart from the interpretation of flow offload entries no other
   processing will be done.
5. (4) means that no Linux IP stack is involved in the forwarding.
6. However (4) allows concatenation of other flow_offload_entry
   producers (e.g. TC, ingress, egress nft chains).
7. flow_offload_entry lists may be connection dependent.
8. Similar to chain forward now, flow_offload_entry lists will be passed
   to devices for hardware acceleration.
9. IOW, flow_offload_entry lists become connection specific programs.
   Therefore such lists may be compiled to EBPF and accelerated on XDP.
10. flow_action_entry interpreters should be prepared to deal with IP
    fragments and other strangeness that ensues on our networks.

> Where this chain is hooked into the flowtable fastpath *ONLY*.

I don't fully understand the ONLY part, but do my points above address
this?

> 
> However, I don't like it either because its incompatible with
> HW offloads and we can be sure that once we allow this people
> will want things like sets and maps too 8-(

I think that due to point (8) above the potential for hardware
acceleration is higher. The hardware (e.g. switch) is free to pass
the packets between flowtable ports and not involve Linux stack at all.
It may do such forwarding because of the promise (4) above.

sets and maps are welcome in chain fast_fwd :) EBPF and XDP already have
them. Once (9) becomes reality we'll be able to suport them, somehow :)

> 
> > > I still think that dscp mangling should be done via netdev:ingress/egress
> > > hooks, I don't see why this has to be bolted into flowtable sw offload.
> > > 
> > Because it can be made faster :)
> 
> If you want to make software fastpath fast(er), explore a XDP program
> that can handle the post-offload packet funneling via xdp_frames to get rid of
> sk_buff allocation overhead.
> 
> Program should make upcalls to stack (costly) to get back to software
> and otherwise appear like HW offload-capable device from flowtable point
> of view.  Some parts of the flowtable infrastructure could probably be exposed
> via kfuncs so that certain functionality such as NAT doesn't have to be
> (re)implemented.
> 
> Requires lots of code churn in flowtable support code to get rid
> of sk_buff * arguments whereever possible.
> 
> There is also pending xmit_more support (skb trains/aggregation),
> perhaps it will land in 6.5 cycle.

I've incorporated XDP into my suggestion above. It seems to fit.

What do you think? Is going chain fast_fwd direction is feasible and
desirable?

Thanks,
Boris.
Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature