On 8/28/24 6:00 PM, Martin KaFai Lau wrote:
On 8/28/24 4:01 PM, Zijian Zhang wrote:
On 8/28/24 2:29 PM, Martin KaFai Lau wrote:
On 8/26/24 6:37 PM, zijianzhang@xxxxxxxxxxxxx wrote:
From: Amery Hung <amery.hung@xxxxxxxxxxxxx>
This series prevents sockops users from accidentally causing packet
drops. This can happen when a BPF_SOCK_OPS_HDR_OPT_LEN_CB program
reserves different option lengths in tcp_sendmsg().
Initially, sockops BPF_SOCK_OPS_HDR_OPT_LEN_CB program will be
called to
reserve a space in tcp_send_mss(), which will return the MSS for TSO.
Then, BPF_SOCK_OPS_HDR_OPT_LEN_CB will be called in
__tcp_transmit_skb()
again to calculate the actual tcp_option_size and skb_push() the total
header size.
skb->gso_size is restored from TCP_SKB_CB(skb)->tcp_gso_size, which is
derived from tcp_send_mss() where we first call HDR_OPT_LEN. If the
reserved opt size is smaller than the actual header size, the len of
the
skb can exceed the MTU. As a result, ip(6)_fragment will drop the
packet if skb->ignore_df is not set.
To prevent this accidental packet drop, we need to make sure the
second call to the BPF_SOCK_OPS_HDR_OPT_LEN_CB program reserves space
not more than the first time.
iiuc, it is a bug in the bpf prog itself that did not reserve the
same header length and caused a drop. It is not the only drop case
though for an incorrect bpf prog. There are other cases where a bpf
prog can accidentally drop a packet.
Do you have an actual use case that the bpf prog cannot reserve the
correct header length for the same sk ?
That's right, it's the bug of the bpf prog itself. We are trying to have
the error reported earlier in eBPF program, instead of successfully
returning from bpf_sock_ops_reserve_hdr_opt but leading to packet drop
at the end because of it.
By adding this patch, the `remaining` variable passed to the
bpf_skops_hdr_opt_len will be more precise, it takes the previously
reserved size into account. As a result, if users accidentally set an
option size larger than the reserved size, bpf_sock_ops_reserve_hdr_opt
will return -ENOSPC instead of 0.
Putting aside it adds more checks, this adds something pretty unique to
the bpf header option comparing with other dynamic options like sack.
e.g. For tp->mss_cache, I assume it won't change since the earlier
tcp_current_mss() was called?
Agreed, this check is very unique comparing with other dynamic options.
How about moving the check into function bpf_skops_hdr_opt_len? It
already has some logic to check if the context is in tcp_current_mss.
Does it look more reasonable?
`reserved_opt_spc = tp->mss_cache - tcp_skb_seglen(skb)`
For tp->mss_cache here, yes, I also think it won't change,
I am trying to get the reserved bpf hdr size. Considering other dynamic
options, this equation is not precise, and may change it to
`reserved_opt_spc = tp->mss_cache - tcp_skb_seglen(skb) - size`?
Or, not sure if adding a field in tcp_sock to precisely cache the
reserved bpf hdr size is a good idea?
We have a use case where we add options to some packets kind of randomly
for the purpose of sampling, and accidentally set a larger option size
than the reserved size. It is the problem of ourselves and takes us
some effort to troubleshoot the root cause.
If bpf_sock_ops_reserve_hdr_opt can return an error in this case, it
could be helpful for users to avoid this mistake.
The bpf_sk_storage can be used to decide if a sk has been sampled.
Also, with bpf_cast_to_kern_ctx and bpf_rdonly_cast, all the checks in
this patch can be done in the bpf prog itself.
Thanks for pointing this out! Agreed, the check can be implemented in
eBPF progs.