Re: [PATCH bpf-next v2 1/2] bpf: tcp: prevent bpf_reserve_hdr_opt() from growing skb larger than MTU

Martin KaFai Lau <martin.lau@xxxxxxxxx> · Wed, 28 Aug 2024 18:00:38 -0700

On 8/28/24 4:01 PM, Zijian Zhang wrote:
On 8/28/24 2:29 PM, Martin KaFai Lau wrote:
On 8/26/24 6:37 PM, zijianzhang@xxxxxxxxxxxxx wrote:
From: Amery Hung <amery.hung@xxxxxxxxxxxxx>

This series prevents sockops users from accidentally causing packet
drops. This can happen when a BPF_SOCK_OPS_HDR_OPT_LEN_CB program
reserves different option lengths in tcp_sendmsg().

Initially, sockops BPF_SOCK_OPS_HDR_OPT_LEN_CB program will be called to
reserve a space in tcp_send_mss(), which will return the MSS for TSO.
Then, BPF_SOCK_OPS_HDR_OPT_LEN_CB will be called in __tcp_transmit_skb()
again to calculate the actual tcp_option_size and skb_push() the total
header size.

skb->gso_size is restored from TCP_SKB_CB(skb)->tcp_gso_size, which is
derived from tcp_send_mss() where we first call HDR_OPT_LEN. If the
reserved opt size is smaller than the actual header size, the len of the
skb can exceed the MTU. As a result, ip(6)_fragment will drop the
packet if skb->ignore_df is not set.

To prevent this accidental packet drop, we need to make sure the
second call to the BPF_SOCK_OPS_HDR_OPT_LEN_CB program reserves space
not more than the first time. 

iiuc, it is a bug in the bpf prog itself that did not reserve the same header 
length and caused a drop. It is not the only drop case though for an incorrect 
bpf prog. There are other cases where a bpf prog can accidentally drop a packet.

Do you have an actual use case that the bpf prog cannot reserve the correct 
header length for the same sk ?

That's right, it's the bug of the bpf prog itself. We are trying to have
the error reported earlier in eBPF program, instead of successfully
returning from bpf_sock_ops_reserve_hdr_opt but leading to packet drop
at the end because of it.

By adding this patch, the `remaining` variable passed to the
bpf_skops_hdr_opt_len will be more precise, it takes the previously
reserved size into account. As a result, if users accidentally set an
option size larger than the reserved size, bpf_sock_ops_reserve_hdr_opt
will return -ENOSPC instead of 0.

Putting aside it adds more checks, this adds something pretty unique to the bpf 
header option comparing with other dynamic options like sack. e.g. For 
tp->mss_cache, I assume it won't change since the earlier tcp_current_mss() was 
called?

We have a use case where we add options to some packets kind of randomly
for the purpose of sampling, and accidentally set a larger option size
than the reserved size. It is the problem of ourselves and takes us
some effort to troubleshoot the root cause.

If bpf_sock_ops_reserve_hdr_opt can return an error in this case, it
could be helpful for users to avoid this mistake.

The bpf_sk_storage can be used to decide if a sk has been sampled.

Also, with bpf_cast_to_kern_ctx and bpf_rdonly_cast, all the checks in this 
patch can be done in the bpf prog itself.