On 8/28/24 4:01 PM, Zijian Zhang wrote:
On 8/28/24 2:29 PM, Martin KaFai Lau wrote:
On 8/26/24 6:37 PM, zijianzhang@xxxxxxxxxxxxx wrote:
From: Amery Hung <amery.hung@xxxxxxxxxxxxx>
This series prevents sockops users from accidentally causing packet
drops. This can happen when a BPF_SOCK_OPS_HDR_OPT_LEN_CB program
reserves different option lengths in tcp_sendmsg().
Initially, sockops BPF_SOCK_OPS_HDR_OPT_LEN_CB program will be called to
reserve a space in tcp_send_mss(), which will return the MSS for TSO.
Then, BPF_SOCK_OPS_HDR_OPT_LEN_CB will be called in __tcp_transmit_skb()
again to calculate the actual tcp_option_size and skb_push() the total
header size.
skb->gso_size is restored from TCP_SKB_CB(skb)->tcp_gso_size, which is
derived from tcp_send_mss() where we first call HDR_OPT_LEN. If the
reserved opt size is smaller than the actual header size, the len of the
skb can exceed the MTU. As a result, ip(6)_fragment will drop the
packet if skb->ignore_df is not set.
To prevent this accidental packet drop, we need to make sure the
second call to the BPF_SOCK_OPS_HDR_OPT_LEN_CB program reserves space
not more than the first time.
iiuc, it is a bug in the bpf prog itself that did not reserve the same header
length and caused a drop. It is not the only drop case though for an incorrect
bpf prog. There are other cases where a bpf prog can accidentally drop a packet.
Do you have an actual use case that the bpf prog cannot reserve the correct
header length for the same sk ?
That's right, it's the bug of the bpf prog itself. We are trying to have
the error reported earlier in eBPF program, instead of successfully
returning from bpf_sock_ops_reserve_hdr_opt but leading to packet drop
at the end because of it.
By adding this patch, the `remaining` variable passed to the
bpf_skops_hdr_opt_len will be more precise, it takes the previously
reserved size into account. As a result, if users accidentally set an
option size larger than the reserved size, bpf_sock_ops_reserve_hdr_opt
will return -ENOSPC instead of 0.
Putting aside it adds more checks, this adds something pretty unique to the bpf
header option comparing with other dynamic options like sack. e.g. For
tp->mss_cache, I assume it won't change since the earlier tcp_current_mss() was
called?
We have a use case where we add options to some packets kind of randomly
for the purpose of sampling, and accidentally set a larger option size
than the reserved size. It is the problem of ourselves and takes us
some effort to troubleshoot the root cause.
If bpf_sock_ops_reserve_hdr_opt can return an error in this case, it
could be helpful for users to avoid this mistake.
The bpf_sk_storage can be used to decide if a sk has been sampled.
Also, with bpf_cast_to_kern_ctx and bpf_rdonly_cast, all the checks in this
patch can be done in the bpf prog itself.