Re: [PATCH v5 bpf-next 07/12] bpf: tcp: Add bpf_skops_hdr_opt_len() and bpf_skops_write_hdr_opt()

Martin KaFai Lau <kafai@xxxxxx> · Thu, 20 Aug 2020 16:16:20 -0700

On Thu, Aug 20, 2020 at 03:39:21PM -0700, Mat Martineau wrote:
> 
> On Thu, 20 Aug 2020, Martin KaFai Lau wrote:
> 
> > The bpf prog needs to parse the SYN header to learn what options have
> > been sent by the peer's bpf-prog before writing its options into SYNACK.
> > This patch adds a "syn_skb" arg to tcp_make_synack() and send_synack().
> > This syn_skb will eventually be made available (as read-only) to the
> > bpf prog.  This will be the only SYN packet available to the bpf
> > prog during syncookie.  For other regular cases, the bpf prog can
> > also use the saved_syn.
> > 
> > When writing options, the bpf prog will first be called to tell the
> > kernel its required number of bytes.  It is done by the new
> > bpf_skops_hdr_opt_len().  The bpf prog will only be called when the new
> > BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG is set in tp->bpf_sock_ops_cb_flags.
> > When the bpf prog returns, the kernel will know how many bytes are needed
> > and then update the "*remaining" arg accordingly.  4 byte alignment will
> > be included in the "*remaining" before this function returns.  The 4 byte
> > aligned number of bytes will also be stored into the opts->bpf_opt_len.
> > "bpf_opt_len" is a newly added member to the struct tcp_out_options.
> > 
> > Then the new bpf_skops_write_hdr_opt() will call the bpf prog to write the
> > header options.  The bpf prog is only called if it has reserved spaces
> > before (opts->bpf_opt_len > 0).
> > 
> > The bpf prog is the last one getting a chance to reserve header space
> > and writing the header option.
> > 
> > These two functions are half implemented to highlight the changes in
> > TCP stack.  The actual codes preparing the bpf running context and
> > invoking the bpf prog will be added in the later patch with other
> > necessary bpf pieces.
> > 
> > Reviewed-by: Eric Dumazet <edumazet@xxxxxxxxxx>
> > Signed-off-by: Martin KaFai Lau <kafai@xxxxxx>
> > ---
> > include/net/tcp.h              |   6 +-
> > include/uapi/linux/bpf.h       |   3 +-
> > net/ipv4/tcp_input.c           |   5 +-
> > net/ipv4/tcp_ipv4.c            |   5 +-
> > net/ipv4/tcp_output.c          | 105 +++++++++++++++++++++++++++++----
> > net/ipv6/tcp_ipv6.c            |   5 +-
> > tools/include/uapi/linux/bpf.h |   3 +-
> > 7 files changed, 109 insertions(+), 23 deletions(-)
> > 
> 
> ...
> 
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index 5084333b5ab6..631a5ee0dd4e 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> 
> ...
> 
> > @@ -826,6 +886,15 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
> > 			opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
> > 	}
> > 
> > +	if (unlikely(BPF_SOCK_OPS_TEST_FLAG(tp,
> > +					    BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG))) {
> > +		unsigned int remaining = MAX_TCP_OPTION_SPACE - size;
> > +
> > +		bpf_skops_hdr_opt_len(sk, skb, NULL, NULL, 0, opts, &remaining);
> > +
> > +		size = MAX_TCP_OPTION_SPACE - remaining;
> > +	}
> > +
> > 	return size;
> > }
> > 
> 
> Since bpf_skops_hdr_opt_len() is called after the SACK code tries to use up
> all remaining option space for SACK blocks, it's less likely that there will
> be sufficient space remaining. Did you consider moving this hunk before the
> SACK option space is allocated to give the bpf prog higher priority?
No.  Not at this point.

It is unlike a well defined option (e.g. MPTCP) that may have a
requirement on being presence in the header.  For a generic usage in bpf,
it is hard to judge if a bpf program is writing a higher priority option.
Also considering SACK is a transient behavior, the bpf prog will eventually
get its chance to write.  The bpf prog should always assume there may not have
enough space and ready to retry later.