Re: [PATCH bpf-next v2 1/2] bpf: tcp: prevent bpf_reserve_hdr_opt() from growing skb larger than MTU

Martin KaFai Lau <martin.lau@xxxxxxxxx> · Thu, 5 Sep 2024 12:38:09 -0700

On 9/5/24 11:19 AM, Zijian Zhang wrote:
On 9/3/24 3:38 PM, Martin KaFai Lau wrote:
On 8/30/24 2:02 PM, Zijian Zhang wrote:
On 8/28/24 6:00 PM, Martin KaFai Lau wrote:
On 8/28/24 4:01 PM, Zijian Zhang wrote:
On 8/28/24 2:29 PM, Martin KaFai Lau wrote:
On 8/26/24 6:37 PM, zijianzhang@xxxxxxxxxxxxx wrote:
From: Amery Hung <amery.hung@xxxxxxxxxxxxx>

This series prevents sockops users from accidentally causing packet
drops. This can happen when a BPF_SOCK_OPS_HDR_OPT_LEN_CB program
reserves different option lengths in tcp_sendmsg().

Initially, sockops BPF_SOCK_OPS_HDR_OPT_LEN_CB program will be called to
reserve a space in tcp_send_mss(), which will return the MSS for TSO.
Then, BPF_SOCK_OPS_HDR_OPT_LEN_CB will be called in __tcp_transmit_skb()
again to calculate the actual tcp_option_size and skb_push() the total
header size.

skb->gso_size is restored from TCP_SKB_CB(skb)->tcp_gso_size, which is
derived from tcp_send_mss() where we first call HDR_OPT_LEN. If the
reserved opt size is smaller than the actual header size, the len of the
skb can exceed the MTU. As a result, ip(6)_fragment will drop the
packet if skb->ignore_df is not set.

To prevent this accidental packet drop, we need to make sure the
second call to the BPF_SOCK_OPS_HDR_OPT_LEN_CB program reserves space
not more than the first time. 

iiuc, it is a bug in the bpf prog itself that did not reserve the same 
header length and caused a drop. It is not the only drop case though for 
an incorrect bpf prog. There are other cases where a bpf prog can 
accidentally drop a packet.

Do you have an actual use case that the bpf prog cannot reserve the 
correct header length for the same sk ?

That's right, it's the bug of the bpf prog itself. We are trying to have
the error reported earlier in eBPF program, instead of successfully
returning from bpf_sock_ops_reserve_hdr_opt but leading to packet drop
at the end because of it.

By adding this patch, the `remaining` variable passed to the
bpf_skops_hdr_opt_len will be more precise, it takes the previously
reserved size into account. As a result, if users accidentally set an
option size larger than the reserved size, bpf_sock_ops_reserve_hdr_opt
will return -ENOSPC instead of 0.

Putting aside it adds more checks, this adds something pretty unique to the 
bpf header option comparing with other dynamic options like sack. e.g. For 
tp- >mss_cache, I assume it won't change since the earlier tcp_current_mss() 
was called?

Agreed, this check is very unique comparing with other dynamic options.
How about moving the check into function bpf_skops_hdr_opt_len? It
already has some logic to check if the context is in tcp_current_mss.
Does it look more reasonable?

Yes, it probably may be better to put that into the bpf_skops_hdr_opt_len(). 
This is implementation details which is something for later after figuring out 
if the reserved_opt_spc change is correct and needed.

`reserved_opt_spc = tp->mss_cache - tcp_skb_seglen(skb)`
For tp->mss_cache here, yes, I also think it won't change,

This needs more details and clear explanation on why it is the case and why 
the existing regular bpf prog will continue to work.

afaik, mss_cache and tcp_skb_seglen() must be in-sync and the same between 
calls for this to work . From thinking about it, it should be but I haven't 
looked at all cases. Missing "- size" does not help the confidence here also.

Also, when "skb != NULL", I don't understand why the "MAX_TCP_OPTION_SPACE - 
size" is still being considered. I am likely missing something here. If the 
above formula is correct, why reserved_opt_spc is not always used for the 
"skb != NULL" case and still need to be compared with the 
"MAX_TCP_OPTION_SPACE - size"?

Cases I can think of are as follows,
- When it's not a GSO skb, tcp_skb_seglen will simply return skb->len,
it might make `tp->mss_cache - tcp_skb_seglen(skb)` a large number.

- When we are in tcp_mtu_probe, tp->mss_cache will be smaller than
tcp_skb_seglen(skb), which makes the equation again a large number.

In tcp_mtu_probe, mss_cache is (smaller) than the tcp_skb_seglen(skb)?

"MAX_TCP_OPTION_SPACE - size" is considered here as the strict upper
bound, while reserved_opt_spc is expected to be a stricter upper bound.
In the above or other cases, where the equation malfunctioned, we can
always fall back to the original bound.

Make sense. Thanks for the explanation. The commit message could have used this 
details.

I am not sure which way to get the reserved size is better,
1. We could precisely cache the result of the reserved size, may
need to have a new field for it, I agree that a field in tcp_sock
is overkill.

2. In this patch, we use an equation to infer the value of it. There are
some concerns with tp->mss_cache and tcp_skb_seglen(skb) here, I agree.
If tp->mss_cache and tcp_skb_seglen(skb) might not be in-sync, we may
have an underestimated reserved_opt_spc, it may break existing bpf
progs. If this method is preferred I will do more investigations to
verify it or modify the equation :)

imo, the bpf prog can always use bpf_sk_storage to add fields to a sock and 
store the previous reserved space there. I think this is the solution you should 
use in your use case which randomly reserves header space. Adding a field in the 
tcp_sock feels wrong especially the only use case I am hearing so far is a bpf 
prog randomly reserving header spaces.

If (2) can be convinced to be correct, improving bpf_reserve_hdr_opt() is fine 
as long as it does not break the existing program. I expect some tests to cover 
this.