Re: [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance

Jakub Sitnicki <jakub@xxxxxxxxxxxxxx> · Fri, 31 May 2024 12:45:03 +0200

On Fri, May 17, 2024 at 03:27 PM +08, Feng Zhou wrote:
> 在 2024/5/17 01:15, Jakub Sitnicki 写道:
>> On Thu, May 16, 2024 at 11:15 AM +08, Feng Zhou wrote:
>>> 在 2024/5/15 17:48, Jakub Sitnicki 写道:

[...]

>> If it's not the BPF prog, which you have ruled out, then where are we
>> burining cycles? Maybe that is something that can be improved.
>> Also, in terms on quantifying the improvement - it is 20% in terms of
>> what? Throughput, pps, cycles? And was that a single data point? For
>> multiple measurements there must be some variance (+/- X pp).
>> Would be great to see some data to back it up.
>> [...]
>> 
>
> Pressure measurement method:
>
> server: sockperf sr --tcp -i x.x.x.x -p 7654 --daemonize
> client: taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
>
> Default mode, no bpf prog:
>
> taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
> sockperf: == version #3.10-23.gited92afb185e6 ==
> sockperf[CLIENT] send on:
> [ 0] IP = x.x.x.x    PORT =  7654 # TCP
> sockperf: Warmup stage (sending a few dummy messages)...
> sockperf: Starting test...
> sockperf: Test end (interrupted by timer)
> sockperf: Test ended
> sockperf: Total of 71520808 messages sent in 30.000 sec
>
> sockperf: NOTE: test was performed, using msg-size=1200. For getting maximum
> throughput consider using --msg-size=1472
> sockperf: Summary: Message Rate is 2384000 [msg/sec]
> sockperf: Summary: BandWidth is 2728.271 MBps (21826.172 Mbps)
>
> perf record --call-graph fp -e cycles:k -C 8 -- sleep 10
> perf report
>
> 80.88%--sock_sendmsg
>  79.53%--tcp_sendmsg
>   42.48%--tcp_sendmsg_locked
>    16.23%--_copy_from_iter
>    4.24%--tcp_send_mss
>     3.25%--tcp_current_mss
>
>
> perf top -C 8
>
> 19.13%  [kernel]            [k] _raw_spin_lock_bh
> 11.75%  [kernel]            [k] copy_user_enhanced_fast_string
>  9.86%  [kernel]            [k] tcp_sendmsg_locked
>  4.44%  sockperf            [.]
>  _Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii
>  4.16%  libpthread-2.28.so  [.] __libc_sendto
>  3.85%  [kernel]            [k] syscall_return_via_sysret
>  2.70%  [kernel]            [k] _copy_from_iter
>  2.48%  [kernel]            [k] entry_SYSCALL_64
>  2.33%  [kernel]            [k] native_queued_spin_lock_slowpath
>  1.89%  [kernel]            [k] __virt_addr_valid
>  1.77%  [kernel]            [k] __check_object_size
>  1.75%  [kernel]            [k] __sys_sendto
>  1.74%  [kernel]            [k] entry_SYSCALL_64_after_hwframe
>  1.42%  [kernel]            [k] __fget_light
>  1.28%  [kernel]            [k] tcp_push
>  1.01%  [kernel]            [k] tcp_established_options
>  0.97%  [kernel]            [k] tcp_send_mss
>  0.94%  [kernel]            [k] syscall_exit_to_user_mode_prepare
>  0.94%  [kernel]            [k] tcp_sendmsg
>  0.86%  [kernel]            [k] tcp_current_mss
>
> Having bpf prog to write tcp opt in all pkts:
>
> taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
> sockperf: == version #3.10-23.gited92afb185e6 ==
> sockperf[CLIENT] send on:
> [ 0] IP = x.x.x.x    PORT =  7654 # TCP
> sockperf: Warmup stage (sending a few dummy messages)...
> sockperf: Starting test...
> sockperf: Test end (interrupted by timer)
> sockperf: Test ended
> sockperf: Total of 60636218 messages sent in 30.000 sec
>
> sockperf: NOTE: test was performed, using msg-size=1200. For getting maximum
> throughput consider using --msg-size=1472
> sockperf: Summary: Message Rate is 2021185 [msg/sec]
> sockperf: Summary: BandWidth is 2313.063 MBps (18504.501 Mbps)
>
> perf record --call-graph fp -e cycles:k -C 8 -- sleep 10
> perf report
>
> 80.30%--sock_sendmsg
>  79.02%--tcp_sendmsg
>   54.14%--tcp_sendmsg_locked
>    12.82%--_copy_from_iter
>    12.51%--tcp_send_mss
>     11.77%--tcp_current_mss
>      10.10%--tcp_established_options
>       8.75%--bpf_skops_hdr_opt_len.isra.54
>        5.71%--__cgroup_bpf_run_filter_sock_ops
>         3.32%--bpf_prog_e7ccbf819f5be0d0_tcpopt
>   6.61%--__tcp_push_pending_frames
>    6.60%--tcp_write_xmit
>     5.89%--__tcp_transmit_skb
>
> perf top -C 8
>
> 10.98%  [kernel]                           [k] _raw_spin_lock_bh
>  9.04%  [kernel]                           [k] copy_user_enhanced_fast_string
>  7.78%  [kernel]                           [k] tcp_sendmsg_locked
>  3.91%  sockperf                           [.]
>  _Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii
>  3.46%  libpthread-2.28.so                 [.] __libc_sendto
>  3.35%  [kernel]                           [k] syscall_return_via_sysret
>  2.86%  [kernel]                           [k] bpf_skops_hdr_opt_len.isra.54
>  2.16%  [kernel]                           [k] __htab_map_lookup_elem
>  2.11%  [kernel]                           [k] _copy_from_iter
>  2.09%  [kernel]                           [k] entry_SYSCALL_64
>  1.97%  [kernel]                           [k] __virt_addr_valid
>  1.95%  [kernel]                           [k] __cgroup_bpf_run_filter_sock_ops
>  1.95%  [kernel]                           [k] lookup_nulls_elem_raw
>  1.89%  [kernel]                           [k] __fget_light
>  1.42%  [kernel]                           [k] __sys_sendto
>  1.41%  [kernel]                           [k] entry_SYSCALL_64_after_hwframe
>  1.31%  [kernel]                           [k] native_queued_spin_lock_slowpath
>  1.22%  [kernel]                           [k] __check_object_size
>  1.18%  [kernel]                           [k] tcp_established_options
>  1.04%  bpf_prog_e7ccbf819f5be0d0_tcpopt   [k] bpf_prog_e7ccbf819f5be0d0_tcpopt
>
> Compare the above test results, fill up a CPU, you can find that
> the upper limit of qps or BandWidth has a loss of nearly 18-20%.
> Then CPU occupancy, you can find that "tcp_send_mss" has increased
> significantly.

This helps prove the point, but what I actually had in mind is to check
"perf annotate bpf_skops_hdr_opt_len" and see if there any low hanging
fruit there which we can optimize.

For instance, when I benchmark it in a VM, I see we're spending cycles
mostly memset()/rep stos. I have no idea where the cycles are spent in
your case.

>
>>>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>>>>> index 90706a47f6ff..f2092de1f432 100644
>>>>> --- a/tools/include/uapi/linux/bpf.h
>>>>> +++ b/tools/include/uapi/linux/bpf.h
>>>>> @@ -6892,8 +6892,14 @@ enum {
>>>>>    	 * options first before the BPF program does.
>>>>>    	 */
>>>>>    	BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
>>>>> +	/* Fast path to reserve space in a skb under
>>>>> +	 * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.
>>>>> +	 * opt length doesn't change often, so it can save in the tcp_sock. And
>>>>> +	 * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call.
>>>>> +	 */
>>>>> +	BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7),
>>>> Have you considered a bpf_reserve_hdr_opt() flag instead?
>>>> An example or test coverage would to show this API extension in action
>>>> would help.
>>>>
>>>
>>> bpf_reserve_hdr_opt () flag can't finish this. I want to optimize
>>> that bpf prog will not be triggered frequently before TSO. Provide
>>> a way for users to not trigger bpf prog when opt len is unchanged.
>>> Then when writing opt, if len changes, clear the flag, and then
>>> change opt len in the next package.
>> I haven't seen a sample using the API extenstion that you're proposing,
>> so I can only guess. But you probably have something like:
>> SEC("sockops")
>> int sockops_prog(struct bpf_sock_ops *ctx)
>> {
>> 	if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
>> 	    ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
>> 		bpf_reserve_hdr_opt(ctx, N, 0);
>> 		bpf_sock_ops_cb_flags_set(ctx,
>> 					  ctx->bpf_sock_ops_cb_flags |
>> 					  MY_NEW_FLAG);
>> 		return 1;
>> 	}
>> }
>
> Yes, that's what I expected.
>
>> I don't understand why you're saying it can't be transformed into:
>> int sockops_prog(struct bpf_sock_ops *ctx)
>> {
>> 	if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
>> 	    ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
>> 		bpf_reserve_hdr_opt(ctx, N, MY_NEW_FLAG);
>> 		return 1;
>> 	}
>> }
>
> "bpf_reserve_hdr_opt (ctx, N, MY_NEW_FLAG);"
>
> I don't know what I can do to pass the flag parameter, let
> "bpf_reserve_hdr_opt" return quickly? But this is not useful,
> because the loss caused by the triggering of bpf prog is very
> expensive, and it is still on the hotspot function of sending
> packets, and the TSO has not been completed yet.
>
>> [...]

This is not what I'm suggesting.

bpf_reserve_hdr_opt() has access to bpf_sock_ops_kern and even the
sock. You could either signal through bpf_sock_ops_kern to
bpf_skops_hdr_opt_len() that it should not be called again

Or even configure the tcp_sock directly from bpf_reserve_hdr_opt()
because it has access to it via bpf_sock_ops_kern{}.sk.