On Fri, May 17, 2024 at 03:27 PM +08, Feng Zhou wrote: > 在 2024/5/17 01:15, Jakub Sitnicki 写道: >> On Thu, May 16, 2024 at 11:15 AM +08, Feng Zhou wrote: >>> 在 2024/5/15 17:48, Jakub Sitnicki 写道: [...] >> If it's not the BPF prog, which you have ruled out, then where are we >> burining cycles? Maybe that is something that can be improved. >> Also, in terms on quantifying the improvement - it is 20% in terms of >> what? Throughput, pps, cycles? And was that a single data point? For >> multiple measurements there must be some variance (+/- X pp). >> Would be great to see some data to back it up. >> [...] >> > > Pressure measurement method: > > server: sockperf sr --tcp -i x.x.x.x -p 7654 --daemonize > client: taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30 > > Default mode, no bpf prog: > > taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30 > sockperf: == version #3.10-23.gited92afb185e6 == > sockperf[CLIENT] send on: > [ 0] IP = x.x.x.x PORT = 7654 # TCP > sockperf: Warmup stage (sending a few dummy messages)... > sockperf: Starting test... > sockperf: Test end (interrupted by timer) > sockperf: Test ended > sockperf: Total of 71520808 messages sent in 30.000 sec > > sockperf: NOTE: test was performed, using msg-size=1200. For getting maximum > throughput consider using --msg-size=1472 > sockperf: Summary: Message Rate is 2384000 [msg/sec] > sockperf: Summary: BandWidth is 2728.271 MBps (21826.172 Mbps) > > perf record --call-graph fp -e cycles:k -C 8 -- sleep 10 > perf report > > 80.88%--sock_sendmsg > 79.53%--tcp_sendmsg > 42.48%--tcp_sendmsg_locked > 16.23%--_copy_from_iter > 4.24%--tcp_send_mss > 3.25%--tcp_current_mss > > > perf top -C 8 > > 19.13% [kernel] [k] _raw_spin_lock_bh > 11.75% [kernel] [k] copy_user_enhanced_fast_string > 9.86% [kernel] [k] tcp_sendmsg_locked > 4.44% sockperf [.] > _Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii > 4.16% libpthread-2.28.so [.] __libc_sendto > 3.85% [kernel] [k] syscall_return_via_sysret > 2.70% [kernel] [k] _copy_from_iter > 2.48% [kernel] [k] entry_SYSCALL_64 > 2.33% [kernel] [k] native_queued_spin_lock_slowpath > 1.89% [kernel] [k] __virt_addr_valid > 1.77% [kernel] [k] __check_object_size > 1.75% [kernel] [k] __sys_sendto > 1.74% [kernel] [k] entry_SYSCALL_64_after_hwframe > 1.42% [kernel] [k] __fget_light > 1.28% [kernel] [k] tcp_push > 1.01% [kernel] [k] tcp_established_options > 0.97% [kernel] [k] tcp_send_mss > 0.94% [kernel] [k] syscall_exit_to_user_mode_prepare > 0.94% [kernel] [k] tcp_sendmsg > 0.86% [kernel] [k] tcp_current_mss > > Having bpf prog to write tcp opt in all pkts: > > taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30 > sockperf: == version #3.10-23.gited92afb185e6 == > sockperf[CLIENT] send on: > [ 0] IP = x.x.x.x PORT = 7654 # TCP > sockperf: Warmup stage (sending a few dummy messages)... > sockperf: Starting test... > sockperf: Test end (interrupted by timer) > sockperf: Test ended > sockperf: Total of 60636218 messages sent in 30.000 sec > > sockperf: NOTE: test was performed, using msg-size=1200. For getting maximum > throughput consider using --msg-size=1472 > sockperf: Summary: Message Rate is 2021185 [msg/sec] > sockperf: Summary: BandWidth is 2313.063 MBps (18504.501 Mbps) > > perf record --call-graph fp -e cycles:k -C 8 -- sleep 10 > perf report > > 80.30%--sock_sendmsg > 79.02%--tcp_sendmsg > 54.14%--tcp_sendmsg_locked > 12.82%--_copy_from_iter > 12.51%--tcp_send_mss > 11.77%--tcp_current_mss > 10.10%--tcp_established_options > 8.75%--bpf_skops_hdr_opt_len.isra.54 > 5.71%--__cgroup_bpf_run_filter_sock_ops > 3.32%--bpf_prog_e7ccbf819f5be0d0_tcpopt > 6.61%--__tcp_push_pending_frames > 6.60%--tcp_write_xmit > 5.89%--__tcp_transmit_skb > > perf top -C 8 > > 10.98% [kernel] [k] _raw_spin_lock_bh > 9.04% [kernel] [k] copy_user_enhanced_fast_string > 7.78% [kernel] [k] tcp_sendmsg_locked > 3.91% sockperf [.] > _Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii > 3.46% libpthread-2.28.so [.] __libc_sendto > 3.35% [kernel] [k] syscall_return_via_sysret > 2.86% [kernel] [k] bpf_skops_hdr_opt_len.isra.54 > 2.16% [kernel] [k] __htab_map_lookup_elem > 2.11% [kernel] [k] _copy_from_iter > 2.09% [kernel] [k] entry_SYSCALL_64 > 1.97% [kernel] [k] __virt_addr_valid > 1.95% [kernel] [k] __cgroup_bpf_run_filter_sock_ops > 1.95% [kernel] [k] lookup_nulls_elem_raw > 1.89% [kernel] [k] __fget_light > 1.42% [kernel] [k] __sys_sendto > 1.41% [kernel] [k] entry_SYSCALL_64_after_hwframe > 1.31% [kernel] [k] native_queued_spin_lock_slowpath > 1.22% [kernel] [k] __check_object_size > 1.18% [kernel] [k] tcp_established_options > 1.04% bpf_prog_e7ccbf819f5be0d0_tcpopt [k] bpf_prog_e7ccbf819f5be0d0_tcpopt > > Compare the above test results, fill up a CPU, you can find that > the upper limit of qps or BandWidth has a loss of nearly 18-20%. > Then CPU occupancy, you can find that "tcp_send_mss" has increased > significantly. This helps prove the point, but what I actually had in mind is to check "perf annotate bpf_skops_hdr_opt_len" and see if there any low hanging fruit there which we can optimize. For instance, when I benchmark it in a VM, I see we're spending cycles mostly memset()/rep stos. I have no idea where the cycles are spent in your case. > >>>>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h >>>>> index 90706a47f6ff..f2092de1f432 100644 >>>>> --- a/tools/include/uapi/linux/bpf.h >>>>> +++ b/tools/include/uapi/linux/bpf.h >>>>> @@ -6892,8 +6892,14 @@ enum { >>>>> * options first before the BPF program does. >>>>> */ >>>>> BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6), >>>>> + /* Fast path to reserve space in a skb under >>>>> + * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB. >>>>> + * opt length doesn't change often, so it can save in the tcp_sock. And >>>>> + * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call. >>>>> + */ >>>>> + BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7), >>>> Have you considered a bpf_reserve_hdr_opt() flag instead? >>>> An example or test coverage would to show this API extension in action >>>> would help. >>>> >>> >>> bpf_reserve_hdr_opt () flag can't finish this. I want to optimize >>> that bpf prog will not be triggered frequently before TSO. Provide >>> a way for users to not trigger bpf prog when opt len is unchanged. >>> Then when writing opt, if len changes, clear the flag, and then >>> change opt len in the next package. >> I haven't seen a sample using the API extenstion that you're proposing, >> so I can only guess. But you probably have something like: >> SEC("sockops") >> int sockops_prog(struct bpf_sock_ops *ctx) >> { >> if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB && >> ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) { >> bpf_reserve_hdr_opt(ctx, N, 0); >> bpf_sock_ops_cb_flags_set(ctx, >> ctx->bpf_sock_ops_cb_flags | >> MY_NEW_FLAG); >> return 1; >> } >> } > > Yes, that's what I expected. > >> I don't understand why you're saying it can't be transformed into: >> int sockops_prog(struct bpf_sock_ops *ctx) >> { >> if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB && >> ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) { >> bpf_reserve_hdr_opt(ctx, N, MY_NEW_FLAG); >> return 1; >> } >> } > > "bpf_reserve_hdr_opt (ctx, N, MY_NEW_FLAG);" > > I don't know what I can do to pass the flag parameter, let > "bpf_reserve_hdr_opt" return quickly? But this is not useful, > because the loss caused by the triggering of bpf prog is very > expensive, and it is still on the hotspot function of sending > packets, and the TSO has not been completed yet. > >> [...] This is not what I'm suggesting. bpf_reserve_hdr_opt() has access to bpf_sock_ops_kern and even the sock. You could either signal through bpf_sock_ops_kern to bpf_skops_hdr_opt_len() that it should not be called again Or even configure the tcp_sock directly from bpf_reserve_hdr_opt() because it has access to it via bpf_sock_ops_kern{}.sk.