On Wed, 2023-05-03 at 21:46 +0200, Ilya Leoshkevich wrote: > On Wed, 2023-05-03 at 12:35 -0700, Yonghong Song wrote: > > Hi, Ilya, > > > > BPF CI ([1]) detected a s390x failure when bpf program is compiled > > with > > latest llvm17 on bpf-next branch. To reproduce the issue, just run > > normal './test_progs -j'. The failure log looks like below: > > > > Notice: Success: 341/3015, Skipped: 29, Failed: 1 > > Error: #191 sock_fields > > Error: #191 sock_fields > > create_netns:PASS:create netns 0 nsec > > create_netns:PASS:bring up lo 0 nsec > > serial_test_sock_fields:PASS:test_sock_fields__open_and_load 0 > > nsec > > > > serial_test_sock_fields:PASS:attach_cgroup(egress_read_sock_fields) > > 0 > > nsec > > > > serial_test_sock_fields:PASS:attach_cgroup(ingress_read_sock_fields > > ) > > 0 nsec > > serial_test_sock_fields:PASS:attach_cgroup(read_sk_dst_port 0 > > nsec > > test:PASS:getsockname(listen_fd) 0 nsec > > test:PASS:getsockname(cli_fd) 0 nsec > > test:PASS:accept(listen_fd) 0 nsec > > init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt_fd) 0 > > nsec > > init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt10_fd) 0 > > nsec > > test:PASS:send(accept_fd) 0 nsec > > test:PASS:recv(cli_fd) 0 nsec > > test:PASS:send(accept_fd) 0 nsec > > test:PASS:recv(cli_fd) 0 nsec > > test:PASS:recv(accept_fd) for fin 0 nsec > > test:PASS:recv(cli_fd) for fin 0 nsec > > check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cnt, > > &accept_fd) 0 nsec > > check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cnt, > > &cli_fd) 0 nsec > > check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0 nsec > > check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0 nsec > > check_result:PASS:bpf_map_lookup_elem(linum_map_fd, > > READ_SK_DST_PORT_IDX) 0 nsec > > check_result:FAIL:failure in read_sk_dst_port on line unexpected > > failure in read_sk_dst_port on line: actual 297 != expected 0 > > listen_sk: state:10 bound_dev_if:0 family:10 type:1 protocol:6 > > mark:0 > > priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1) > > src_port:51966 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:0(::) dst_port:0 > > srv_sk: state:9 bound_dev_if:0 family:10 type:1 protocol:6 > > mark:0 > > priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1) > > src_port:51966 dst_ip4:7f000006(127.0.0.6) dst_ip6:0:0:0:1(::1) > > dst_port:38030 > > cli_sk: state:5 bound_dev_if:0 family:10 type:1 protocol:6 > > mark:0 > > priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1) > > src_port:38030 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:1(::1) > > dst_port:51966 > > listen_tp: snd_cwnd:10 srtt_us:0 rtt_min:4294967295 > > snd_ssthresh:2147483647 rcv_nxt:0 snd_nxt:0 snd:una:0 mss_cache:536 > > ecn_flags:0 rate_delivered:0 rate_interval_us:0 packets_out:0 > > retrans_out:0 total_retrans:0 segs_in:0 data_segs_in:0 segs_out:0 > > data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:0 > > bytes_acked:0 > > srv_tp: snd_cwnd:10 srtt_us:3904 rtt_min:272 > > snd_ssthresh:2147483647 > > rcv_nxt:648617715 snd_nxt:4218065810 snd:una:4218065810 > > mss_cache:32768 > > ecn_flags:0 rate_delivered:1 rate_interval_us:272 packets_out:0 > > retrans_out:0 total_retrans:0 segs_in:5 data_segs_in:0 segs_out:3 > > data_segs_out:2 lost_out:0 sacked_out:0 bytes_received:1 > > bytes_acked:22 > > cli_tp: snd_cwnd:10 srtt_us:6035 rtt_min:730 > > snd_ssthresh:2147483647 > > rcv_nxt:4218065811 snd_nxt:648617715 snd:una:648617715 > > mss_cache:32768 > > ecn_flags:0 rate_delivered:1 rate_interval_us:925 packets_out:0 > > retrans_out:0 total_retrans:0 segs_in:4 data_segs_in:2 segs_out:6 > > data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:23 > > bytes_acked:2 > > check_result:PASS:listen_sk 0 nsec > > check_result:PASS:srv_sk 0 nsec > > check_result:PASS:srv_tp 0 nsec > > > > If bpf program is compiled with llvm16, the test passed according > > to > > a CI run. > > > > I don't have s390x environment to debug this. Could you help debug > > it? > > > > Thanks! > > > > [1] > > https://github.com/kernel-patches/vmtest/actions/runs/4866851496/jobs/8679080985?pr=224#step:6:7645 > > > Hi, > > thank for letting me know. > I will look into this. > > Best regards, > Ilya In the meantime the issue was fixed by: commit 141be5c062ecf22bd287afffd310e8ac4711444a Author: Shoaib Meenai <smeenai@xxxxxx> Date: Fri May 5 14:18:12 2023 -0700 Revert "Reland [Pipeline] Don't limit ArgumentPromotion to -O3" This reverts commit 6f29d1adf29820daae9ea7a01ae2588b67735b9e. https://reviews.llvm.org/D149768 is causing size regressions for - Oz with FullLTO, and I'm reverting that one while investigating. This commit depends on that one, so it needs to be reverted as well. But looking at the codegen differences: $ diff -u <(sed -e s/[0-9]*://g pass.s) <(sed -e s/[0-9]*://g fail.s) -pass.o: file format elf64-bpf +fail.o: file format elf64-bpf -00000000000002c8 <sk_dst_port__load_half> - 69 11 00 30 00 00 00 00 r1 = *(u16 *)(r1 + 48) +00000000000002c0 <sk_dst_port__load_half> + 54 10 00 00 00 00 ff ff w1 &= 65535 b4 00 00 00 00 00 00 01 w0 = 1 16 10 00 01 00 00 ca fe if w1 == 51966 goto +1 <LBB6_2> b4 00 00 00 00 00 00 00 w0 = 0 This is what ArgumentPromotion is supposed to do, so that's okay so far. However, further down below we have: Disassembly of section cgroup_skb/egress: - bf 16 00 00 00 00 00 00 r1 = r6 + 61 76 00 30 00 00 00 00 r7 = *(u32 *)(r6 + 48) + bc 17 00 00 00 00 00 00 w1 = w7 85 01 00 00 00 00 00 53 call sk_dst_port__load_word ... - bf 16 00 00 00 00 00 00 r1 = r6 + 74 70 00 00 00 00 00 10 w7 >>= 16 + bc 17 00 00 00 00 00 00 w1 = w7 85 01 00 00 00 00 00 57 call sk_dst_port__load_half so there is no 16-bit load anymore, instead, the result from the earlier 32-bit load is reused. However, on BE these kinds of loads for this particular field are not consistent at the moment - see [1] and the previous discussions. De-facto we have the following results: - int load: 0x0000cafe - short load: 0xcafe On a consistent BE we should have rather had: - int load: 0x0000cafe - short load: 0 Clang, of course, expects a consistent BE and optimizes around that. This was a conscious tradeoff Jakub and I have agreed on in order to keep the quirky behavior from the past. Given what's happening with Clang now, I wonder if it would be worth revisiting it in the name of consistency? [1] https://lore.kernel.org/bpf/20220317113920.1068535-5-jakub@xxxxxxxxxxxxxx