Re: [PATCH nf-next 00/19] netfilter: nftables: dscp modification offload

Boris Sukholitko <boris.sukholitko@xxxxxxxxxxxx> · Wed, 10 May 2023 10:49:25 +0300

On Tue, May 09, 2023 at 11:48:27AM +0200, Florian Westphal wrote:
> Boris Sukholitko <boris.sukholitko@xxxxxxxxxxxx> wrote:
> > On Sun, May 07, 2023 at 07:37:58PM +0200, Florian Westphal wrote:
> > > Boris Sukholitko <boris.sukholitko@xxxxxxxxxxxx> wrote:
> > > > On Wed, May 3, 2023 at 9:46 PM Florian Westphal <fw@xxxxxxxxx> wrote:
> > > > >
> > > > > Boris Sukholitko <boris.sukholitko@xxxxxxxxxxxx> wrote:
> > > > [... snip to non working offload ...]
> > > > 
> > > > > > table inet filter {
> > > > > >         flowtable f1 {
> > > > > >                 hook ingress priority filter
> > > > > >                 devices = { veth0, veth1 }
> > > > > >         }
> > > > > >
> > > > > >         chain forward {
> > > > > >                 type filter hook forward priority filter; policy accept;
> > > > > >                 ip dscp set cs3 offload
> > > > > >                 ip protocol { tcp, udp, gre } flow add @f1
> > > > > >                 ct state established,related accept
> > > > > >         }
> > > > > > }
> > > > 
> > > > [...]
> > > > 
> > > > >
> > > > > I wish you would have reported this before you started to work on
> > > > > this, because this is not a bug, this is expected behaviour.
> > > > >
> > > > > Once you offload, the ruleset is bypassed, this is by design.
> > > > 
> > > > From the rules UI perspective it seems possible to accelerate
> > > > forward chain handling with the statements such as dscp modification there.
> > > > 
> > > > Isn't it better to modify the packets according to the bypassed
> > > > ruleset thus making the behaviour more consistent?
> > > 
> > > The behaviour is consistent.  Once flow is offloaded, ruleset is
> > > bypassed.  Its easy to not offload those flows that need the ruleset.
> > > 
> > > > > Lets not make the software offload more complex as it already is.
> > > > 
> > > > Could you please tell which parts of software offload are too complex?
> > > > It's not too bad from what I've seen :)
> > > > 
> > > > This patch series adds 56 lines of code in the new nf_conntrack.ext.c
> > > > file. 20 of them (nf_flow_offload_apply_payload) are used in
> > > > the software fast path. Is it too high of a price?
> > > 
> > > 56 lines of code *now*.
> > > 
> > > Next someone wants to call into sets/maps for named counters that
> > > they need.  Then someone wants limit or quota to work.  Then they want fib
> > > for RPF.  Then xfrm policy matching to augment acccounting.
> > > This will go on until we get to the point where removing "fast" path
> > > turns into a performance optimization.
> > 
> > OK. May I assume that you are concerned with the eventual performance impact
> > on the software fast path (i.e. nf_flow_offload_ip_hook)?
> 
> Yes, but I also dislike the concept, see below.
> 
> > Obviously the performance of the fast path is very important to our
> > customers. Otherwise they would not be requiring dscp fast path
> > modification. :)
> > 
> > One of the things we've thought about regarding the fast path
> > performance is rewriting nf_flow_offload_ip_hook to work with
> > nf_flowtable->flow_block instead of flow_offload_tuple.
> 
> Sorry, I should have expanded on my reservations towards this concept.
> 
> Let me explain.
> Lets consider your original example first:
> 
> ----------
> table inet filter {
>         flowtable f1 {
>                 hook ingress priority filter
>                 devices = { veth0, veth1 }
>         }
> 
>         chain forward {
>                 type filter hook forward priority filter; policy accept;
>                 ip dscp set cs3
>                 ip protocol { tcp, udp, gre } flow add
>                 ct state established,related accept
>         }
> }
> ----------
> 
> This has a clearly defined meaning in all possible combinations.
> 
> Software:
> 1. It defines a bypass for veth0 <-> veth1
> 2. the way this specific ruleset is defined, all of tcp/udp/gre will
>    attempt to offload

OK.

> 3. once offload has happened, entire inet:forward may be bypassed

By bypassed, do you mean that chain forward ruleset body becomes
irrelevant? If the unfortunate answer is yes, than as in the original
report, once we are in the software fast path we do not do dscp
modification, right?

Then once we hit the NF_FLOW_TIMEOUT, some of the packets will have dscp
modified, because we've went out of software acceleration. I.e. some of
the packets will arrive with and some without dscp. As you can imagine,
this could be hard to debug. This is the scenario that the patch series
tries to fix.

> 4. User ruleset needs to cope with packets being moved back to
>    software: fragmented packets, tcp fin/rst, hw timeouts and so on.

Should we require our user to understand that some lines in their
forward table configuration may or may not be executed sporadically?
Should we give the user at least some kind of warning regarding this
during the ruleset load?

To be constructive, isn't it better to rephrase points 3 and 4 as:

3. once offload has happened, entire inet:forward will be executed with
the same semantics but with better performance. Any difference between
fast and slow path is considered a bug.

4. If something in user ruleset (such as dscp rule now) precludes fast
path optimization then either error will be given or slow path will be
taken with a warning.

> 5. User can control via 'offload' keyword if HW offload should be
>    attempted or not
> 
> Hardware:
> even 'nf_flow_offload_ip_hook' may be bypassed.  But nothing changes
> compared to 'no hw offload' case from a conceptual point of view.
> 

Agreed.

> Lets now consider existing netdev:ingress/egress in this same picture:
> (Example from Pablo):
> ------
> table inet filter {
>         flowtable f1 {
>                 hook ingress priority filter
>                 devices = { veth0, veth1 }
>         }
> 
>         chain ingress {
>                 type filter hook ingress device veth0 priority filter; policy accept; flags offload;
>                 ip dscp set cs3
>         }
> 
>         chain forward {
>                 type filter hook forward priority filter; policy accept;
>                 meta l4proto { tcp, udp, gre } flow add @f1
>                 ct state established,related accept
>         }
> }
> 
> Again, this has defined meaning in all combinations:
> With HW offload: veth0 will be told to mangle dscp.
> This happens in all cases and for every matching packet,
> regardless if a flowtable exists or not.
> 
> Same would happen for 'egress', just that it would happen at xmit time
> rather at receive time.  Again, its not relevant if there is active
> flowtable or not, or if data path is offloaded to hardware, to software,
> handled by fallback or entirely without flowtables being present.
> 
> Its also clear that this is tied to 'veth0', other devices will
> not be involved and not do such mangling.
> 

As I've mentioned in my other reply to Pablo, our focus is exclusively
on the *software* fast path of the forward chain. In this scenario
getting into additional nftables VM path in the ingress or egress seems
like pessimization which we'd like to avoid.

> Now lets look at your proposal:
> ----------------
> table inet filter {
>         flowtable f1 {
>                 hook ingress priority filter
>                 devices = { veth0, veth1 }
>         }
> 
>         chain forward {
>                 type filter hook forward priority filter; policy accept;
>                 ip dscp set cs3 offload
>                 ip protocol { tcp, udp, gre } flow add
>                 ct state established,related accept
>         }
> }
> ----------------
> 
> This means that software flowtable offload
> shall do a 'ip dscp set cs3'.
> 
> What if the flowtable is offloaded to hardware
> entirely, without software fallback?
> 
> What if the devices listed in the flowtable definition can handle
> flow offload, but no payload mangling?
> 
> Does the 'offload' mean that the rule is only active for
> software path?  Only for hardware path? both?
> 
> How can I tell if its offloaded to hardware for one device
> but not for the other?  Or will that be disallowed?
> 
> What if someone adds another rule after or before 'ip dscp',
> but without the 'offload' keyword?  Now ordering becomes an
> issue.
> 
> Users now need to consider different control flows:
> 
>   jump exceptions
>   ip dscp set cs3 offload
> 
>   chain exceptions {
>     ip daddr 1.2.3.4 accept
>   }
> 
> This won't work as expected, because offloaded flows will not
> pass through 'forward' chain but somehow a few selected rules
> will be run anyway.
> 
> TL;DR: I think that for HW offload its paramount to make it crystal
> clear as to which device is responsible to handle such rules.

Your critique of my offload flag is well deserved and fully correct,
thanks! The reason for the dscp statement offload flag was to try to be
explicit about the need for payload modification capture.

What we've really wanted to do here is to make the payload capture
dependent on the chain flowtable acceleration status. I.e. if the
forward chain is supposed to be accelerated, than the payload
modification capture should happen. Being lazy, I've went through that
ugly keyword path. Apologies for that.

> 
> The existing netdev:ingress/egress hooks provide the needed
> chain/rules/expression:device mapping.  User can easily
> tell if HW is responsible or SW by looking for 'offload' flag
> presence.
> 

Yes. But again hardware offload is irrelevant for the problem we have
here.

> I don't think mixing software and hardware offload contexts as proposed
> is a good idea, both from user frontend syntax, clarity and error reporting
> (e.g. if hw rejects offload request) point of view.
> 

Frontend syntax and nft userspace should not be affected once we drop
the unneeded offload flag on dscp statement.

> I also believe that allowing payload mangling from *software* offload
> path sets a precedence for essentially allowing all other expressions
> again which completely negates the flowtable concept.

IMHO, the flowtable concept means transparent acceleration of packet
processing between the specified interfaces. If something in the ruleset
precludes such acceleration warnings/errors should be given.

Do you agree?

Once the goal of significant performance gains is preserved, the
expansion of the universe of accelerated expressions is benign. And why
not? We are still being fast.

My suggestion regarding doing the software fast path through the
flow_block tries to decouple the performance of the fast path from the
size of such universe. The basic idea is that you don't pay for what you
don't use.

> 
> I still think that dscp mangling should be done via netdev:ingress/egress
> hooks, I don't see why this has to be bolted into flowtable sw offload.
> 

Because it can be made faster :)

Thanks,
Boris.
Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature