On Thu, Aug 20, 2020 at 03:39:21PM -0700, Mat Martineau wrote: > > On Thu, 20 Aug 2020, Martin KaFai Lau wrote: > > > The bpf prog needs to parse the SYN header to learn what options have > > been sent by the peer's bpf-prog before writing its options into SYNACK. > > This patch adds a "syn_skb" arg to tcp_make_synack() and send_synack(). > > This syn_skb will eventually be made available (as read-only) to the > > bpf prog. This will be the only SYN packet available to the bpf > > prog during syncookie. For other regular cases, the bpf prog can > > also use the saved_syn. > > > > When writing options, the bpf prog will first be called to tell the > > kernel its required number of bytes. It is done by the new > > bpf_skops_hdr_opt_len(). The bpf prog will only be called when the new > > BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG is set in tp->bpf_sock_ops_cb_flags. > > When the bpf prog returns, the kernel will know how many bytes are needed > > and then update the "*remaining" arg accordingly. 4 byte alignment will > > be included in the "*remaining" before this function returns. The 4 byte > > aligned number of bytes will also be stored into the opts->bpf_opt_len. > > "bpf_opt_len" is a newly added member to the struct tcp_out_options. > > > > Then the new bpf_skops_write_hdr_opt() will call the bpf prog to write the > > header options. The bpf prog is only called if it has reserved spaces > > before (opts->bpf_opt_len > 0). > > > > The bpf prog is the last one getting a chance to reserve header space > > and writing the header option. > > > > These two functions are half implemented to highlight the changes in > > TCP stack. The actual codes preparing the bpf running context and > > invoking the bpf prog will be added in the later patch with other > > necessary bpf pieces. > > > > Reviewed-by: Eric Dumazet <edumazet@xxxxxxxxxx> > > Signed-off-by: Martin KaFai Lau <kafai@xxxxxx> > > --- > > include/net/tcp.h | 6 +- > > include/uapi/linux/bpf.h | 3 +- > > net/ipv4/tcp_input.c | 5 +- > > net/ipv4/tcp_ipv4.c | 5 +- > > net/ipv4/tcp_output.c | 105 +++++++++++++++++++++++++++++---- > > net/ipv6/tcp_ipv6.c | 5 +- > > tools/include/uapi/linux/bpf.h | 3 +- > > 7 files changed, 109 insertions(+), 23 deletions(-) > > > > ... > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > > index 5084333b5ab6..631a5ee0dd4e 100644 > > --- a/net/ipv4/tcp_ipv4.c > > +++ b/net/ipv4/tcp_ipv4.c > > ... > > > @@ -826,6 +886,15 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb > > opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK; > > } > > > > + if (unlikely(BPF_SOCK_OPS_TEST_FLAG(tp, > > + BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG))) { > > + unsigned int remaining = MAX_TCP_OPTION_SPACE - size; > > + > > + bpf_skops_hdr_opt_len(sk, skb, NULL, NULL, 0, opts, &remaining); > > + > > + size = MAX_TCP_OPTION_SPACE - remaining; > > + } > > + > > return size; > > } > > > > Since bpf_skops_hdr_opt_len() is called after the SACK code tries to use up > all remaining option space for SACK blocks, it's less likely that there will > be sufficient space remaining. Did you consider moving this hunk before the > SACK option space is allocated to give the bpf prog higher priority? No. Not at this point. It is unlike a well defined option (e.g. MPTCP) that may have a requirement on being presence in the header. For a generic usage in bpf, it is hard to judge if a bpf program is writing a higher priority option. Also considering SACK is a transient behavior, the bpf prog will eventually get its chance to write. The bpf prog should always assume there may not have enough space and ready to retry later.