On Thu, Aug 24, 2023 at 10:39 PM +08, Liu Jian wrote: > If the sockmap msg redirection function is used only to forward packets > and no other operation, the execution result of the BPF_SK_MSG_VERDICT > program is the same each time. In this case, the BPF program only needs to > be run once. Add BPF_F_PERMANENT flag to bpf_msg_redirect_map() and > bpf_msg_redirect_hash() to implement this ability. > > Then we can enable this function in the bpf program as follows: > bpf_msg_redirect_hash(xx, xx, xx, BPF_F_INGRESS | BPF_F_PERMANENT); > > Test results using netperf TCP_STREAM mode: > for i in 1 64 128 512 1k 2k 32k 64k 100k 500k 1m;then > netperf -T 1,2 -t TCP_STREAM -H 127.0.0.1 -l 20 -- -m $i -s 100m,100m -S 100m,100m > done > > before: > 3.84 246.52 496.89 1885.03 3415.29 6375.03 40749.09 48764.40 51611.34 55678.26 55992.78 > after: > 4.43 279.20 555.82 2080.79 3870.70 7105.44 41836.41 49709.75 51861.56 55211.00 54566.85 > > Signed-off-by: Liu Jian <liujian56@xxxxxxxxxx> > --- > include/linux/skmsg.h | 1 + > include/uapi/linux/bpf.h | 15 +++++++++++++-- > net/core/skmsg.c | 5 +++++ > net/core/sock_map.c | 4 ++-- > net/ipv4/tcp_bpf.c | 18 +++++++++++++----- > tools/include/uapi/linux/bpf.h | 15 +++++++++++++-- > 6 files changed, 47 insertions(+), 11 deletions(-) > > diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h > index 054d7911bfc9..2f4e9811ff85 100644 > --- a/include/linux/skmsg.h > +++ b/include/linux/skmsg.h > @@ -82,6 +82,7 @@ struct sk_psock { > u32 cork_bytes; > u32 eval; > bool redir_ingress; /* undefined if sk_redir is null */ > + bool redir_permanent; > struct sk_msg *cork; > struct sk_psock_progs progs; > #if IS_ENABLED(CONFIG_BPF_STREAM_PARSER) > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 70da85200695..f4de1ba390b4 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -3004,7 +3004,12 @@ union bpf_attr { > * egress interfaces can be used for redirection. The > * **BPF_F_INGRESS** value in *flags* is used to make the > * distinction (ingress path is selected if the flag is present, > - * egress path otherwise). This is the only flag supported for now. > + * egress path otherwise). The **BPF_F_PERMANENT** value in > + * *flags* is used to indicates whether the eBPF result is > + * permanently (please note that, BPF_F_PERMANENT does not work with > + * msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or > + * msg_cork_bytes() is configured, the BPF_F_PERMANENT function is > + * automatically disabled). There are some grammar mistakes here we need to fix. Hint - I find it helpful to run the text through an online grammar checker or an AI chatbot when unsure. Either way, let's reword so the flags are clearly listed out, since we now have two of them: The following *flags* are supported: **BPF_F_INGRESS** Both ingress and egress interfaces can be used for redirection. The **BPF_F_INGRESS** value in *flags* is used to make the distinction. Ingress path is selected if the flag is present, egress path otherwise. **BPF_F_PERMANENT** Indicates that redirect verdict and the target socket should be remembered. The verdict program will not be run for subsequent packets, unless an error occurs when forwarding packets. **BPF_F_PERMANENT** cannot be use together with **bpf_msg_apply_bytes**\ () and **bpf_msg_cork_bytes**\ (). If either has been called, the **BPF_F_PERMANENT** flag is ignored. Please check the formatting is correct with: ./scripts/bpf_doc.py --filename include/uapi/linux/bpf.h \ | rst2man /dev/stdin | man /dev/stdin That said, I'm not sure I like these semantics - flag being ignored under some circumstances. It leads to a silent failure. If I've asked for a permanent redirect, but it is cannot work in my BPF program, I'd rather get an error from bpf_msg_redirect_map/hash(), than be surprised that it is getting executed for every packet and have to troubleshoot why. > * Return > * **SK_PASS** on success, or **SK_DROP** on error. > * > @@ -3276,7 +3281,12 @@ union bpf_attr { > * egress interfaces can be used for redirection. The > * **BPF_F_INGRESS** value in *flags* is used to make the > * distinction (ingress path is selected if the flag is present, > - * egress path otherwise). This is the only flag supported for now. > + * egress path otherwise). The **BPF_F_PERMANENT** value in > + * *flags* is used to indicates whether the eBPF result is > + * permanently (please note that, BPF_F_PERMANENT does not work with > + * msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or > + * msg_cork_bytes() is configured, the BPF_F_PERMANENT function is > + * automatically disabled). > * Return > * **SK_PASS** on success, or **SK_DROP** on error. > * > @@ -5872,6 +5882,7 @@ enum { > /* BPF_FUNC_clone_redirect and BPF_FUNC_redirect flags. */ > enum { > BPF_F_INGRESS = (1ULL << 0), > + BPF_F_PERMANENT = (1ULL << 1), > }; > > /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */ > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > index a29508e1ff35..df1443cf5fbd 100644 > --- a/net/core/skmsg.c > +++ b/net/core/skmsg.c > @@ -885,6 +885,11 @@ int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock, > goto out; > } > psock->redir_ingress = sk_msg_to_ingress(msg); > + if (!msg->apply_bytes && !msg->cork_bytes) > + psock->redir_permanent = > + msg->flags & BPF_F_PERMANENT; > + else > + psock->redir_permanent = false; Above can be rewritten as: psock->redir_permanent = !msg->apply_bytes && !msg->cork_bytes && (msg->flags & BPF_F_PERMANENT); But as I wrote earlier, I don't think it's a good idea to ignore the flag. We can detect this conflict at the time the bpf_msg_sk_redirect_* helper is called and return an error. Naturally that means that that bpf_msg_{cork,apply}_bytes helpers need to be adjusted to return an error if BPF_F_PERMANENT has been set. > psock->sk_redir = msg->sk_redir; > sock_hold(psock->sk_redir); > } > diff --git a/net/core/sock_map.c b/net/core/sock_map.c > index 08ab108206bf..35a361614f5e 100644 > --- a/net/core/sock_map.c > +++ b/net/core/sock_map.c > @@ -662,7 +662,7 @@ BPF_CALL_4(bpf_msg_redirect_map, struct sk_msg *, msg, > { > struct sock *sk; > > - if (unlikely(flags & ~(BPF_F_INGRESS))) > + if (unlikely(flags & ~(BPF_F_INGRESS | BPF_F_PERMANENT))) > return SK_DROP; > > sk = __sock_map_lookup_elem(map, key); > @@ -1261,7 +1261,7 @@ BPF_CALL_4(bpf_msg_redirect_hash, struct sk_msg *, msg, > { > struct sock *sk; > > - if (unlikely(flags & ~(BPF_F_INGRESS))) > + if (unlikely(flags & ~(BPF_F_INGRESS | BPF_F_PERMANENT))) > return SK_DROP; > > sk = __sock_hash_lookup_elem(map, key); > diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c > index 81f0dff69e0b..b53e356562a6 100644 > --- a/net/ipv4/tcp_bpf.c > +++ b/net/ipv4/tcp_bpf.c > @@ -419,8 +419,10 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, > if (!psock->apply_bytes) { > /* Clean up before releasing the sock lock. */ > eval = psock->eval; > - psock->eval = __SK_NONE; > - psock->sk_redir = NULL; > + if (!psock->redir_permanent) { > + psock->eval = __SK_NONE; > + psock->sk_redir = NULL; > + } > } > if (psock->cork) { > cork = true; > @@ -433,9 +435,15 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, > ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress, > msg, tosend, flags); > sent = origsize - msg->sg.size; > + /* disable the ability when something wrong */ > + if (unlikely(ret < 0)) > + psock->redir_permanent = 0; > > - if (eval == __SK_REDIRECT) > + if (!psock->redir_permanent && eval == __SK_REDIRECT) { I believe eval == __SK_REDIRECT is always true here, and the eval local variable is redundant. We will be in this switch branch only if psock->eval == __SK_REDIRECT. It's something we missed during the review of cd9733f5d75c ("tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function"). > sock_put(sk_redir); > + psock->sk_redir = NULL; > + psock->eval = __SK_NONE; > + } > > lock_sock(sk); > if (unlikely(ret < 0)) { > @@ -460,8 +468,8 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, > } > > if (likely(!ret)) { > - if (!psock->apply_bytes) { > - psock->eval = __SK_NONE; > + if (!psock->apply_bytes && !psock->redir_permanent) { > + psock->eval = __SK_NONE; > if (psock->sk_redir) { > sock_put(psock->sk_redir); > psock->sk_redir = NULL; > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h > index 70da85200695..f4de1ba390b4 100644 > --- a/tools/include/uapi/linux/bpf.h > +++ b/tools/include/uapi/linux/bpf.h > @@ -3004,7 +3004,12 @@ union bpf_attr { > * egress interfaces can be used for redirection. The > * **BPF_F_INGRESS** value in *flags* is used to make the > * distinction (ingress path is selected if the flag is present, > - * egress path otherwise). This is the only flag supported for now. > + * egress path otherwise). The **BPF_F_PERMANENT** value in > + * *flags* is used to indicates whether the eBPF result is > + * permanently (please note that, BPF_F_PERMANENT does not work with > + * msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or > + * msg_cork_bytes() is configured, the BPF_F_PERMANENT function is > + * automatically disabled). > * Return > * **SK_PASS** on success, or **SK_DROP** on error. > * > @@ -3276,7 +3281,12 @@ union bpf_attr { > * egress interfaces can be used for redirection. The > * **BPF_F_INGRESS** value in *flags* is used to make the > * distinction (ingress path is selected if the flag is present, > - * egress path otherwise). This is the only flag supported for now. > + * egress path otherwise). The **BPF_F_PERMANENT** value in > + * *flags* is used to indicates whether the eBPF result is > + * permanently (please note that, BPF_F_PERMANENT does not work with > + * msg_apply_bytes() and msg_cork_bytes(), if msg_apply_bytes() or > + * msg_cork_bytes() is configured, the BPF_F_PERMANENT function is > + * automatically disabled). > * Return > * **SK_PASS** on success, or **SK_DROP** on error. > * > @@ -5872,6 +5882,7 @@ enum { > /* BPF_FUNC_clone_redirect and BPF_FUNC_redirect flags. */ > enum { > BPF_F_INGRESS = (1ULL << 0), > + BPF_F_PERMANENT = (1ULL << 1), > }; > > /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */