Re: [PATCH nf] netfilter: flowtable: infer TCP state and timeout before flow teardown

Sven Auhagen <sven.auhagen@xxxxxxxxxxxx> · Mon, 8 Apr 2024 07:24:43 +0200

On Wed, Mar 20, 2024 at 01:37:58PM +0100, Pablo Neira Ayuso wrote:
> On Wed, Mar 20, 2024 at 12:15:51PM +0100, Sven Auhagen wrote:
> > On Wed, Mar 20, 2024 at 11:47:50AM +0100, Pablo Neira Ayuso wrote:
> > > On Wed, Mar 20, 2024 at 11:29:05AM +0100, Sven Auhagen wrote:
> > > > On Wed, Mar 20, 2024 at 10:51:39AM +0100, Pablo Neira Ayuso wrote:
> > > > > On Wed, Mar 20, 2024 at 10:31:00AM +0100, Sven Auhagen wrote:
> > > > > > On Wed, Mar 20, 2024 at 10:27:30AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > On Wed, Mar 20, 2024 at 10:20:29AM +0100, Sven Auhagen wrote:
> > > > > > > > On Wed, Mar 20, 2024 at 10:07:24AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > > > On Wed, Mar 20, 2024 at 09:49:49AM +0100, Sven Auhagen wrote:
> > > > > > > > > > On Wed, Mar 20, 2024 at 09:45:16AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > > > > > Hi Sven,
> > > > > > > > > > > 
> > > > > > > > > > > On Wed, Mar 20, 2024 at 09:39:16AM +0100, Sven Auhagen wrote:
> > > > > > > > > > > > On Mon, Mar 18, 2024 at 10:39:15AM +0100, Pablo Neira Ayuso wrote:
> > > > > > > > > > > [...]
> > > > > > > > > > > > > diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
> > > > > > > > > > > > > index a0571339239c..481fe3d96bbc 100644
> > > > > > > > > > > > > --- a/net/netfilter/nf_flow_table_core.c
> > > > > > > > > > > > > +++ b/net/netfilter/nf_flow_table_core.c
> > > > > > > > > > > > > @@ -165,10 +165,22 @@ void flow_offload_route_init(struct flow_offload *flow,
> > > > > > > > > > > > >  }
> > > > > > > > > > > > >  EXPORT_SYMBOL_GPL(flow_offload_route_init);
> > > > > > > > > > > > >  
> > > > > > > > > > > > > -static void flow_offload_fixup_tcp(struct ip_ct_tcp *tcp)
> > > > > > > > > > > > > +static s32 flow_offload_fixup_tcp(struct net *net, struct nf_conn *ct,
> > > > > > > > > > > > > +				  enum tcp_conntrack tcp_state)
> > > > > > > > > > > > >  {
> > > > > > > > > > > > > -	tcp->seen[0].td_maxwin = 0;
> > > > > > > > > > > > > -	tcp->seen[1].td_maxwin = 0;
> > > > > > > > > > > > > +	struct nf_tcp_net *tn = nf_tcp_pernet(net);
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +	ct->proto.tcp.state = tcp_state;
> > > > > > > > > > > > > +	ct->proto.tcp.seen[0].td_maxwin = 0;
> > > > > > > > > > > > > +	ct->proto.tcp.seen[1].td_maxwin = 0;
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +	/* Similar to mid-connection pickup with loose=1.
> > > > > > > > > > > > > +	 * Avoid large ESTABLISHED timeout.
> > > > > > > > > > > > > +	 */
> > > > > > > > > > > > > +	if (tcp_state == TCP_CONNTRACK_ESTABLISHED)
> > > > > > > > > > > > > +		return tn->timeouts[TCP_CONNTRACK_UNACK];
> > > > > > > > > > > > 
> > > > > > > > > > > > Hi Pablo,
> > > > > > > > > > > > 
> > > > > > > > > > > > I tested the patch but the part that sets the timout to UNACK is not
> > > > > > > > > > > > very practical.
> > > > > > > > > > > > For example my long running SSH connections get killed off by the firewall
> > > > > > > > > > > > regularly now while beeing ESTABLISHED:
> > > > > > > > > > > > 
> > > > > > > > > > > > [NEW] tcp      6 120 SYN_SENT src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 [UNREPLIED] src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > > > > > > > > > [UPDATE] tcp      6 60 SYN_RECV src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 mark=16777216
> > > > > > > > > > > > [UPDATE] tcp      6 86400 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=55582 dport=22 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=55582 [OFFLOAD] mark=16777216
> > > > > > > > > > > > 
> > > > > > > > > > > > [DESTROY] tcp      6 ESTABLISHED src=192.168.6.55 dst=192.168.10.22 sport=54941 dport=22 packets=133 bytes=13033 src=192.168.10.22 dst=192.168.6.55 sport=22 dport=54941 packets=95 bytes=15004 [ASSURED] mark=16777216 delta-time=1036
> > > > > > > > > > > > 
> > > > > > > > > > > > I would remove the if case here.
> > > > > > > > > > > 
> > > > > > > > > > > OK, I remove it and post a v2. Thanks!
> > > > > > > > > > 
> > > > > > > > > > Thanks and also the hardcoded TCP_CONNTRACK_ESTABLISHED in flow_offload_fixup_ct
> > > > > > > > > > should be reverted to the real tcp state ct->proto.tcp.state.
> > > > > > > > > 
> > > > > > > > > ct->proto.tcp.state contains the state that was observed before
> > > > > > > > > handling over this flow to the flowtable, in most cases, this should
> > > > > > > > > be TCP_CONNTRACK_ESTABLISHED.
> > > > > > > > > 
> > > > > > > > > > This way we always set the current TCP timeout.
> > > > > > > > > 
> > > > > > > > > I can keep it to ct->proto.tcp.state but I wonder if it is better to
> > > > > > > > > use a well known state such as TCP_CONNTRACK_ESTABLISHED to pick up from.
> > > > > > > > 
> > > > > > > > In case of a race condition or if something is off like my TCP_FIN
> > > > > > > > that is beeing offloaded again setting to to ESTABLISHED hard coded
> > > > > > > > will make the e.g. FIN or CLOSE a very long state.
> > > > > > > > It is not guaranteed that we are still in ESTABLISHED when this code
> > > > > > > > runs. Also for example we could have seen both FIN already before the
> > > > > > > > flowtable gc runs.
> > > > > > > 
> > > > > > > OK, I just posted a v2, leave things as is. I agree it is better to
> > > > > > > only address the issue you are observing at this time, it is possible
> > > > > > > to revisit later.
> > > > > > > 
> > > > > > > Thanks!
> > > > > > 
> > > > > > Thanks, I will give it another try.
> > > > > > I think for it to be foolproof we need
> > > > > > to migrate the TCP state as well in flow_offload_teardown_tcp to FIN or CLOSE.
> > > > > 
> > > > > My patch already does it:
> > > > > 
> > > > > +void flow_offload_teardown_tcp(struct flow_offload *flow, bool fin)
> > > > > +{
> > > > > +       enum tcp_conntrack tcp_state;
> > > > > +
> > > > > +       if (fin)
> > > > > +               tcp_state = TCP_CONNTRACK_FIN_WAIT;
> > > > > +       else /* rst */
> > > > > +               tcp_state = TCP_CONNTRACK_CLOSE;
> > > > > +
> > > > > +       flow_offload_fixup_tcp(nf_ct_net(flow->ct), flow->ct, tcp_state);
> > > > > 
> > > > > flow_offload_fixup_tcp() updates the TCP state to FIN / CLOSE state.
> > > > > 
> > > > > > They way thinks are right now we are open to a race condition from the reverse side of the
> > > > > > connection to reoffload the connection while a FIN or RST is not processed by the netfilter code
> > > > > > running after the flowtable code.
> > > > > > The conenction is still in TCP established during that window and another packet can just
> > > > > > push it back to the flowtable while the FIN or RST is not processed yet.
> > > > > 
> > > > > I don't see how:
> > > > > 
> > > > > static void nft_flow_offload_eval(const struct nft_expr *expr,
> > > > >                                   ...
> > > > > 
> > > > >         switch (ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum) {
> > > > >         case IPPROTO_TCP:
> > > > >                 tcph = skb_header_pointer(pkt->skb, nft_thoff(pkt),
> > > > >                                           sizeof(_tcph), &_tcph);
> > > > >                 if (unlikely(!tcph || tcph->fin || tcph->rst ||
> > > > >                              !nf_conntrack_tcp_established(ct)))
> > > > >                         goto out;
> > > > > 
> > > > > this would now be either in FIN/CLOSE state.
> > > > > 
> > > > > FIN, RST packet does not trigger re-offload. And ACK packet would find
> > > > > the entry in !nf_conntrack_tcp_established(ct).
> > > > > 
> > > > > What path could trigger re-offload after my latest patch?
> > > > 
> > > > From looking through the nf conntrack tcp code you need to spin_lock
> > > > the TCP state change to avoid a race with another packet.
> > > 
> > > The flowtable owns the flow, packets belonging the flow cannot update
> > > the TCP state while the flow is offloaded to the flowtable.
> > > 
> > > Once _TEARDOWN flag is set on, then packets get back to classic
> > > conntrack path.
> > 
> > Hmm alright, something is going wrong somewhere and it looks a lot like
> > a race condition :)
> 
> This check is racy, another packet could alter the ct state right
> after it evaluates true.
> 
>          switch (ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum) {
>          case IPPROTO_TCP:
>                  tcph = skb_header_pointer(pkt->skb, nft_thoff(pkt),
>                                            sizeof(_tcph), &_tcph);
>                  if (unlikely(!tcph || tcph->fin || tcph->rst ||
>                               !nf_conntrack_tcp_established(ct))) <-------
>                          goto out;
> 
> Sequence would be:
> 
> 1) flow expires (after 30 seconds), goes back to conntrack in established state
> 2) packet re-offloads the flow and nf_conntrack_tcp_established(ct) evaluates true.
> 3) FIN packet races to update conntrack getting to FIN_WAIT while re-offloading
>    the flow.
> 
> then you see FIN_WAIT and offload, it could happen with an expired
> flow that goes back to conntrack.
> 
> But I am not sure yet if this is the case you're observing there.
> 
> > I mean just in theory it is not guaranteed that both directions send all
> > packets through the flowtable just because it is offloaded.
> > A variety of error checks might send a packet back to the slow path.
> 
> There is the mtu check that is lacking the teardown, but that should
> only affect UDP traffic. A patch from Felix decided has cached the mtu
> in the flow entry. That is also probably convenient to have, but it
> looks like a different issue, I will also post a patch for this issue.

Hi Pablo,

after some testing the problem only happens very rarely now.
I suspect it happens only on connections that are at some point
one way only or in some other way not in a correct state anymore.
Never the less your latest patches are very good and reduce the problem
to an absolute minimum that FIN WAIT is offlodaded and the timeout
is correct now.

Here is one example if a flow that still is in FIN WAIT:

[NEW] tcp      6 120 SYN_SENT src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 [UNREPLIED] src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 mark=16777216
[UPDATE] tcp      6 60 SYN_RECV src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 mark=16777216
[UPDATE] tcp      6 86400 ESTABLISHED src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 [OFFLOAD] mark=16777216
[UPDATE] tcp      6 120 FIN_WAIT src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 [OFFLOAD] mark=16777216
[UPDATE] tcp      6 30 LAST_ACK src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 [ASSURED] mark=16777216
 [UPDATE] tcp      6 120 TIME_WAIT src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 [ASSURED] mark=16777216
 [DESTROY] tcp      6 TIME_WAIT src=fd00::192:168:5:32 dst=2a05:d014:687:ed01::21 sport=58790 dport=443 packets=15 bytes=1750 src=2a05:d014:687:ed01::21 dst=2003:a:c7f:e5e8:a3ae:63f7:7f92:e286 sport=443 dport=60848 packets=13 bytes=6905 [ASSURED] mark=16777216 delta-time=120

Best
Sven