Re: [PATCH bpf-next v2 1/2] bpf: tcp: prevent bpf_reserve_hdr_opt() from growing skb larger than MTU

Zijian Zhang <zijianzhang@xxxxxxxxxxxxx> · Thu, 5 Sep 2024 11:19:31 -0700

On 9/3/24 3:38 PM, Martin KaFai Lau wrote:
On 8/30/24 2:02 PM, Zijian Zhang wrote:
On 8/28/24 6:00 PM, Martin KaFai Lau wrote:
On 8/28/24 4:01 PM, Zijian Zhang wrote:
On 8/28/24 2:29 PM, Martin KaFai Lau wrote:
On 8/26/24 6:37 PM, zijianzhang@xxxxxxxxxxxxx wrote:
From: Amery Hung <amery.hung@xxxxxxxxxxxxx>

This series prevents sockops users from accidentally causing packet
drops. This can happen when a BPF_SOCK_OPS_HDR_OPT_LEN_CB program
reserves different option lengths in tcp_sendmsg().

Initially, sockops BPF_SOCK_OPS_HDR_OPT_LEN_CB program will be 
called to
reserve a space in tcp_send_mss(), which will return the MSS for TSO.
Then, BPF_SOCK_OPS_HDR_OPT_LEN_CB will be called in 
__tcp_transmit_skb()
again to calculate the actual tcp_option_size and skb_push() the 
total
header size.

skb->gso_size is restored from TCP_SKB_CB(skb)->tcp_gso_size, 
which is
derived from tcp_send_mss() where we first call HDR_OPT_LEN. If the
reserved opt size is smaller than the actual header size, the len 
of the
skb can exceed the MTU. As a result, ip(6)_fragment will drop the
packet if skb->ignore_df is not set.

To prevent this accidental packet drop, we need to make sure the
second call to the BPF_SOCK_OPS_HDR_OPT_LEN_CB program reserves space
not more than the first time. 

iiuc, it is a bug in the bpf prog itself that did not reserve the 
same header length and caused a drop. It is not the only drop case 
though for an incorrect bpf prog. There are other cases where a bpf 
prog can accidentally drop a packet.

Do you have an actual use case that the bpf prog cannot reserve the 
correct header length for the same sk ?

That's right, it's the bug of the bpf prog itself. We are trying to 
have
the error reported earlier in eBPF program, instead of successfully
returning from bpf_sock_ops_reserve_hdr_opt but leading to packet drop
at the end because of it.

By adding this patch, the `remaining` variable passed to the
bpf_skops_hdr_opt_len will be more precise, it takes the previously
reserved size into account. As a result, if users accidentally set an
option size larger than the reserved size, bpf_sock_ops_reserve_hdr_opt
will return -ENOSPC instead of 0.

Putting aside it adds more checks, this adds something pretty unique 
to the bpf header option comparing with other dynamic options like 
sack. e.g. For tp- >mss_cache, I assume it won't change since the 
earlier tcp_current_mss() was called?

Agreed, this check is very unique comparing with other dynamic options.
How about moving the check into function bpf_skops_hdr_opt_len? It
already has some logic to check if the context is in tcp_current_mss.
Does it look more reasonable?

Yes, it probably may be better to put that into the 
bpf_skops_hdr_opt_len(). This is implementation details which is 
something for later after figuring out if the reserved_opt_spc change is 
correct and needed.

`reserved_opt_spc = tp->mss_cache - tcp_skb_seglen(skb)`
For tp->mss_cache here, yes, I also think it won't change,

This needs more details and clear explanation on why it is the case and 
why the existing regular bpf prog will continue to work.

afaik, mss_cache and tcp_skb_seglen() must be in-sync and the same 
between calls for this to work . From thinking about it, it should be 
but I haven't looked at all cases. Missing "- size" does not help the 
confidence here also.

Also, when "skb != NULL", I don't understand why the 
"MAX_TCP_OPTION_SPACE - size" is still being considered. I am likely 
missing something here. If the above formula is correct, why 
reserved_opt_spc is not always used for the "skb != NULL" case and still 
need to be compared with the "MAX_TCP_OPTION_SPACE - size"?

Cases I can think of are as follows,
- When it's not a GSO skb, tcp_skb_seglen will simply return skb->len,
it might make `tp->mss_cache - tcp_skb_seglen(skb)` a large number.

- When we are in tcp_mtu_probe, tp->mss_cache will be smaller than
tcp_skb_seglen(skb), which makes the equation again a large number.

"MAX_TCP_OPTION_SPACE - size" is considered here as the strict upper
bound, while reserved_opt_spc is expected to be a stricter upper bound.
In the above or other cases, where the equation malfunctioned, we can
always fall back to the original bound.

I am not sure which way to get the reserved size is better,
1. We could precisely cache the result of the reserved size, may
need to have a new field for it, I agree that a field in tcp_sock
is overkill.

2. In this patch, we use an equation to infer the value of it. There are
some concerns with tp->mss_cache and tcp_skb_seglen(skb) here, I agree.
If tp->mss_cache and tcp_skb_seglen(skb) might not be in-sync, we may
have an underestimated reserved_opt_spc, it may break existing bpf
progs. If this method is preferred I will do more investigations to
verify it or modify the equation :)

Do you have preference to either option?

I am trying to get the reserved bpf hdr size. Considering other dynamic
options, this equation is not precise, and may change it to
`reserved_opt_spc = tp->mss_cache - tcp_skb_seglen(skb) - size`?

"- size" is needed. The test did not catch it?

Or, not sure if adding a field in tcp_sock to precisely cache the
reserved bpf hdr size is a good idea?

imo, adding one field in tcp_sock for this is overkill.

It seems your bpf prog is randomly reserving space. If this is the case, 
giving a fail signal in bpf_reserve_hdr_opt() does not improve the 
success rate for your random bpf prog to reserve header. I don't think 
adding a field or the change in this patch really solves anything in 
your randomly reserving space use case ?

It's true that it won't help with the success rate, but it could help
save packets from being dropped.

The goal is to let bpf_reserve_hdr_opt fail, when it is supposed to.
And, as a bonus effect, it could also save packets from being dropped.

When the accidental mistake happens, we want bpf_reserve_hdr_opt to
fail, so that `sock_ops.remaining_opt_len` won't be updated.
If it is still updated in this case, the packet will be dropped in
ip_fragment because `IPCB(skb)->frag_max_size > mtu`.

We have a use case where we add options to some packets kind of 
randomly
for the purpose of sampling, and accidentally set a larger option size
than the reserved size. It is the problem of ourselves and takes us
some effort to troubleshoot the root cause.

If bpf_sock_ops_reserve_hdr_opt can return an error in this case, it
could be helpful for users to avoid this mistake.

The bpf_sk_storage can be used to decide if a sk has been sampled.

Also, with bpf_cast_to_kern_ctx and bpf_rdonly_cast, all the checks 
in this patch can be done in the bpf prog itself.

Thanks for pointing this out! Agreed, the check can be implemented in
eBPF progs.