On Wed, Jul 26, 2023 at 07:14:56AM -0700, Yan Zhai wrote: > On Wed, Jul 26, 2023 at 04:39:08PM +0300, Dan Carpenter wrote: > > I'm not positive I understand the code in ip_finish_output2(). I think > > instead of looking for LWTUNNEL_XMIT_DONE it should instead look for > > != LWTUNNEL_XMIT_CONTINUE. It's unfortunate that NET_XMIT_DROP and > > LWTUNNEL_XMIT_CONTINUE are the both 0x1. Why don't we just change that > > instead? > > > I considered about changing lwt side logic. But it would bring larger > impact since there are multiple types of encaps on this hook, not just > bpf redirect. Changing bpf return values is a minimum change on the > other hand. In addition, returning value of NET_RX_DROP and > NET_XMIT_CN are the same, so if we don't do something in bpf redirect, > there is no way to distinguish them later: the former is considered as > an error, while "CN" is considered as non-error. Uh, NET_RX/XMIT_DROP values are 1. NET_XMIT_CN is 2. I'm not an expert but I think what happens is that we treat NET_XMIT_CN as success so that it takes a while for the resend to happen. Eventually the TCP layer will detect it as a dropped packet. > > > Also there seems to be a leak in lwtunnel_xmit(). Should that return > > LWTUNNEL_XMIT_CONTINUE or should it call kfree_skb() before returning? > > > > Something like the following? > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > > index 11652e464f5d..375790b672bc 100644 > > --- a/include/linux/netdevice.h > > +++ b/include/linux/netdevice.h > > @@ -112,6 +112,9 @@ void netdev_sw_irq_coalesce_default_on(struct net_device *dev); > > #define NET_XMIT_CN 0x02 /* congestion notification */ > > #define NET_XMIT_MASK 0x0f /* qdisc flags in net/sch_generic.h */ > > > > +#define LWTUNNEL_XMIT_DONE NET_XMIT_SUCCESS > > +#define LWTUNNEL_XMIT_CONTINUE 0x3 > > + > > /* NET_XMIT_CN is special. It does not guarantee that this packet is lost. It > > * indicates that the device will soon be dropping packets, or already drops > > * some packets of the same priority; prompting us to send less aggressively. */ > > diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h > > index 6f15e6fa154e..8ab032ee04d0 100644 > > --- a/include/net/lwtunnel.h > > +++ b/include/net/lwtunnel.h > > @@ -16,12 +16,6 @@ > > #define LWTUNNEL_STATE_INPUT_REDIRECT BIT(1) > > #define LWTUNNEL_STATE_XMIT_REDIRECT BIT(2) > > > > -enum { > > - LWTUNNEL_XMIT_DONE, > > - LWTUNNEL_XMIT_CONTINUE, > > -}; > > - > > - > > struct lwtunnel_state { > > __u16 type; > > __u16 flags; > > diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c > > index 711cd3b4347a..732415d1287d 100644 > > --- a/net/core/lwtunnel.c > > +++ b/net/core/lwtunnel.c > > @@ -371,7 +371,7 @@ int lwtunnel_xmit(struct sk_buff *skb) > > > > if (lwtstate->type == LWTUNNEL_ENCAP_NONE || > > lwtstate->type > LWTUNNEL_ENCAP_MAX) > > - return 0; > > + return LWTUNNEL_XMIT_CONTINUE; > > You are correct this path would leak skb. Return continue (or drop) > would avoid the leak. Personally I'd prefer drop instead to signal the > error setup. Since this is a separate issue, do you want to send a > separate patch on this? Or I am happy to do it if you prefer. > I don't know which makes sense so I'll leave that up to you. > > > > ret = -EOPNOTSUPP; > > rcu_read_lock(); > > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c > > index 6e70839257f7..4be50a211b14 100644 > > --- a/net/ipv4/ip_output.c > > +++ b/net/ipv4/ip_output.c > > @@ -216,7 +216,7 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s > > if (lwtunnel_xmit_redirect(dst->lwtstate)) { > > int res = lwtunnel_xmit(skb); > > > > - if (res < 0 || res == LWTUNNEL_XMIT_DONE) > > + if (res != LWTUNNEL_XMIT_CONTINUE) > > return res; > > Unfortunately we cannot return res directly here when res > 0. This is > the final reason why I didn't patch here. Return values here can be > propagated back to sendmsg syscall, so returning a positive value > would break the syscall convention. The neigh_output() function is going to return NET_XMIT_DROP so this already happens. Is that not what we want to happen? I guess my concern is that eventually people will eventually new introduce bugs. Fixing incorrect error codes is something that I do several times per week. :P regards, dan carpenter