On Sun, 2023-05-14 at 09:58 -0700, Yonghong Song wrote: > > > On 5/13/23 1:24 AM, Ilya Leoshkevich wrote: > > On Fri, 2023-05-12 at 21:13 -0700, Yonghong Song wrote: > > > > > > > > > On 5/12/23 7:40 AM, Ilya Leoshkevich wrote: > > > > On Wed, 2023-05-03 at 21:46 +0200, Ilya Leoshkevich wrote: > > > > > On Wed, 2023-05-03 at 12:35 -0700, Yonghong Song wrote: > > > > > > Hi, Ilya, > > > > > > > > > > > > BPF CI ([1]) detected a s390x failure when bpf program is > > > > > > compiled > > > > > > with > > > > > > latest llvm17 on bpf-next branch. To reproduce the issue, > > > > > > just > > > > > > run > > > > > > normal './test_progs -j'. The failure log looks like below: > > > > > > > > > > > > Notice: Success: 341/3015, Skipped: 29, Failed: 1 > > > > > > Error: #191 sock_fields > > > > > > Error: #191 sock_fields > > > > > > create_netns:PASS:create netns 0 nsec > > > > > > create_netns:PASS:bring up lo 0 nsec > > > > > > > > > > > > serial_test_sock_fields:PASS:test_sock_fields__open_and_loa > > > > > > d 0 > > > > > > nsec > > > > > > > > > > > > serial_test_sock_fields:PASS:attach_cgroup(egress_read_sock > > > > > > _fie > > > > > > lds) > > > > > > 0 > > > > > > nsec > > > > > > > > > > > > serial_test_sock_fields:PASS:attach_cgroup(ingress_read_soc > > > > > > k_fi > > > > > > elds > > > > > > ) > > > > > > 0 nsec > > > > > > > > > > > > serial_test_sock_fields:PASS:attach_cgroup(read_sk_dst_port > > > > > > 0 > > > > > > nsec > > > > > > test:PASS:getsockname(listen_fd) 0 nsec > > > > > > test:PASS:getsockname(cli_fd) 0 nsec > > > > > > test:PASS:accept(listen_fd) 0 nsec > > > > > > > > > > > > init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt_fd) > > > > > > 0 > > > > > > nsec > > > > > > > > > > > > init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt10_f > > > > > > d) 0 > > > > > > nsec > > > > > > test:PASS:send(accept_fd) 0 nsec > > > > > > test:PASS:recv(cli_fd) 0 nsec > > > > > > test:PASS:send(accept_fd) 0 nsec > > > > > > test:PASS:recv(cli_fd) 0 nsec > > > > > > test:PASS:recv(accept_fd) for fin 0 nsec > > > > > > test:PASS:recv(cli_fd) for fin 0 nsec > > > > > > > > > > > > check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cn > > > > > > t, > > > > > > &accept_fd) 0 nsec > > > > > > > > > > > > check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cn > > > > > > t, > > > > > > &cli_fd) 0 nsec > > > > > > check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0 > > > > > > nsec > > > > > > check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0 > > > > > > nsec > > > > > > check_result:PASS:bpf_map_lookup_elem(linum_map_fd, > > > > > > READ_SK_DST_PORT_IDX) 0 nsec > > > > > > check_result:FAIL:failure in read_sk_dst_port on line > > > > > > unexpected > > > > > > failure in read_sk_dst_port on line: actual 297 != expected > > > > > > 0 > > > > > > listen_sk: state:10 bound_dev_if:0 family:10 type:1 > > > > > > protocol:6 > > > > > > mark:0 > > > > > > priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1) > > > > > > src_port:51966 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:0(::) > > > > > > dst_port:0 > > > > > > srv_sk: state:9 bound_dev_if:0 family:10 type:1 > > > > > > protocol:6 > > > > > > mark:0 > > > > > > priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1) > > > > > > src_port:51966 dst_ip4:7f000006(127.0.0.6) > > > > > > dst_ip6:0:0:0:1(::1) > > > > > > dst_port:38030 > > > > > > cli_sk: state:5 bound_dev_if:0 family:10 type:1 > > > > > > protocol:6 > > > > > > mark:0 > > > > > > priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1) > > > > > > src_port:38030 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:1(::1) > > > > > > dst_port:51966 > > > > > > listen_tp: snd_cwnd:10 srtt_us:0 rtt_min:4294967295 > > > > > > snd_ssthresh:2147483647 rcv_nxt:0 snd_nxt:0 snd:una:0 > > > > > > mss_cache:536 > > > > > > ecn_flags:0 rate_delivered:0 rate_interval_us:0 > > > > > > packets_out:0 > > > > > > retrans_out:0 total_retrans:0 segs_in:0 data_segs_in:0 > > > > > > segs_out:0 > > > > > > data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:0 > > > > > > bytes_acked:0 > > > > > > srv_tp: snd_cwnd:10 srtt_us:3904 rtt_min:272 > > > > > > snd_ssthresh:2147483647 > > > > > > rcv_nxt:648617715 snd_nxt:4218065810 snd:una:4218065810 > > > > > > mss_cache:32768 > > > > > > ecn_flags:0 rate_delivered:1 rate_interval_us:272 > > > > > > packets_out:0 > > > > > > retrans_out:0 total_retrans:0 segs_in:5 data_segs_in:0 > > > > > > segs_out:3 > > > > > > data_segs_out:2 lost_out:0 sacked_out:0 bytes_received:1 > > > > > > bytes_acked:22 > > > > > > cli_tp: snd_cwnd:10 srtt_us:6035 rtt_min:730 > > > > > > snd_ssthresh:2147483647 > > > > > > rcv_nxt:4218065811 snd_nxt:648617715 snd:una:648617715 > > > > > > mss_cache:32768 > > > > > > ecn_flags:0 rate_delivered:1 rate_interval_us:925 > > > > > > packets_out:0 > > > > > > retrans_out:0 total_retrans:0 segs_in:4 data_segs_in:2 > > > > > > segs_out:6 > > > > > > data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:23 > > > > > > bytes_acked:2 > > > > > > check_result:PASS:listen_sk 0 nsec > > > > > > check_result:PASS:srv_sk 0 nsec > > > > > > check_result:PASS:srv_tp 0 nsec > > > > > > > > > > > > If bpf program is compiled with llvm16, the test passed > > > > > > according > > > > > > to > > > > > > a CI run. > > > > > > > > > > > > I don't have s390x environment to debug this. Could you > > > > > > help > > > > > > debug > > > > > > it? > > > > > > > > > > > > Thanks! > > > > > > > > > > > > [1] > > > > > > https://github.com/kernel-patches/vmtest/actions/runs/4866851496/jobs/8679080985?pr=224#step:6:7645 > > > > > > > > > > > > > > > Hi, > > > > > > > > > > thank for letting me know. > > > > > I will look into this. > > > > > > > > > > Best regards, > > > > > Ilya > > > > > > > > In the meantime the issue was fixed by: > > > > > > > > commit 141be5c062ecf22bd287afffd310e8ac4711444a > > > > Author: Shoaib Meenai <smeenai@xxxxxx> > > > > Date: Fri May 5 14:18:12 2023 -0700 > > > > > > > > Revert "Reland [Pipeline] Don't limit ArgumentPromotion > > > > to - > > > > O3" > > > > > > > > This reverts commit > > > > 6f29d1adf29820daae9ea7a01ae2588b67735b9e. > > > > > > > > https://reviews.llvm.org/D149768 ; is causing size > > > > regressions > > > > for - > > > > Oz > > > > with FullLTO, and I'm reverting that one while > > > > investigating. > > > > This > > > > commit depends on that one, so it needs to be reverted as > > > > well. > > > > > > The transformtion "Don't limit ArgumentPromotion to -O3" is > > > temporarily > > > reverted. But it could be reverted again once the issue is > > > resolved. > > > So it is a good idea to resolve the issue in the kernel. > > > > > > > > > > > But looking at the codegen differences: > > > > > > > > $ diff -u <(sed -e s/[0-9]*://g pass.s) <(sed -e s/[0-9]*://g > > > > fail.s) > > > > > > > > -pass.o: file format elf64-bpf > > > > +fail.o: file format elf64-bpf > > > > > > > > -00000000000002c8 <sk_dst_port__load_half> > > > > - 69 11 00 30 00 00 00 00 r1 = *(u16 *)(r1 + 48) > > > > +00000000000002c0 <sk_dst_port__load_half> > > > > + 54 10 00 00 00 00 ff ff w1 &= 65535 > > > > b4 00 00 00 00 00 00 01 w0 = 1 > > > > 16 10 00 01 00 00 ca fe if w1 == 51966 goto +1 > > > > <LBB6_2> > > > > b4 00 00 00 00 00 00 00 w0 = 0 > > > > > > > > This is what ArgumentPromotion is supposed to do, so that's > > > > okay so > > > > far. However, further down below we have: > > > > > > > > Disassembly of section cgroup_skb/egress: > > > > > > > > - bf 16 00 00 00 00 00 00 r1 = r6 > > > > + 61 76 00 30 00 00 00 00 r7 = *(u32 *)(r6 + 48) > > > > + bc 17 00 00 00 00 00 00 w1 = w7 > > > > 85 01 00 00 00 00 00 53 call sk_dst_port__load_word > > > > > > > > ... > > > > > > > > - bf 16 00 00 00 00 00 00 r1 = r6 > > > > + 74 70 00 00 00 00 00 10 w7 >>= 16 > > > > + bc 17 00 00 00 00 00 00 w1 = w7 > > > > 85 01 00 00 00 00 00 57 call sk_dst_port__load_half > > > > > > > > so there is no 16-bit load anymore, instead, the result from > > > > the > > > > earlier 32-bit load is reused. However, on BE these kinds of > > > > loads > > > > for this particular field are not consistent at the moment - > > > > see > > > > [1] > > > > and the previous discussions. > > > > > > > > De-facto we have the following results: > > > > > > > > - int load: 0x0000cafe > > > > - short load: 0xcafe > > > > > > So 'De-facto' means the above is the expected result. > > > > > > > > > > > On a consistent BE we should have rather had: > > > > > > > > - int load: 0x0000cafe > > > > - short load: 0 > > > > > > What does 'consistent BE' mean here? Does it mean the expected > > > result from the source code itself? > > > > I should not have called the de-facto example "BE" at all: it's > > rather > > "mixed endianness" or "weird endianness" or something along these > > lines. > > > > On "consistent BE" or simply "BE" properties like > > > > *(uint32_t *)p = (*(uint16_t *)p << 16) | *(uint16_t *)(p + 2); > > > > hold. This is currently not the case for bpf_sock.dst_port. > > > > We compile with -mbig-endian, so we promise to the compiler that > > the > > machine is big-endian, and the compiler expects the above to hold > > for > > any p. Unfortunately when p points to bpf_sock.dst_port, this is > > not > > the case. > > If I understand correctly, *(uint32_t *)p to get the > bpf_sock.dst_port > is for backward compatibility. But the real u32 read by compiler will > do (*(uint16_t *)p << 16) | *(uint16_t *)(p + 2) which is not the > same as expected *(uint32_t *)p so we have problem here. > > > > > The property above is important for the correctness of the > > load/store > > tearing transformations. What we have here is not exactly tearing, > > but > > is quite close. > > > > > > Clang, of course, expects a consistent BE and optimizes around > > > > that. > > > > > > > > This was a conscious tradeoff Jakub and I have agreed on in > > > > order > > > > to > > > > keep the quirky behavior from the past. Given what's happening > > > > with > > > > Clang now, I wonder if it would be worth revisiting it in the > > > > name > > > > of > > > > consistency? > > > > > > If I understand correctly, I think the issue is > > > r7 = *(u32 *)(r6 + 48) > > > w7 >= 16 > > > w1 = w7 > > > > > > after verifier, it is changed to > > > r7 = *(u16 *)(r6 + <kernel offset>) > > > w7 >= 16 > > > w1 = w7 > > > > > > and the result after verifier rewrite is completely wrong. > > > Is it right? > > > > No, the verifier rewrite is correct. > > The sk_dst_port__load_word() part of the test succeeds. > > > > All these rewrites already work fine, they are correct and > > consistent. > > It's really just bpf_sock.dst_port that is special. > > > > > If this is the case, during verifier rewrite, if it is > > > big endian, say the user intends to load 4 bytes (from uapi > > > header) > > > while the kernel field is 2 bytes, in such cases, kernel > > > can still pretend to generate 4-byte load. For example, > > > for the above example, the code after verification could be: > > > r7 = *(u16 *)(r6 + <kernel offset>) > > > r7 <= 16 > > > w7 >= 16 > > > w1 = w7 > > > > > > Hopefully, we won't have many such cases. Does this work? > > > > This would break the sk_dst_port__load_word() part of the test. > > This is a hack. This may work for this specific u16 case, but > yes, it won't work for u32 load case. > > > > > > > > > Above I asked whether we can resolve the inconsistency, but I > > thought > > about it and I don't see a way of doing it without breaking the > > ABI, > > which is at worst unacceptable, and at best a last resort measure. > > > > What do you think about marking bpf_sock.dst_port volatile? > > volatile > > should prevent tearing and similar optimizations, with which we > > have a > > problem here. > > > > We could also add a comment warning users not to cast away this > > volatile due to the quirk we have. And then we should adjust the > > test > > (making all casts volatile) to comply with this new warning. > > I did a little study on this. The main problem here for > static __noinline bool sk_dst_port__load_half(struct bpf_sock *sk) > { > __u16 *half = (__u16 *)&sk->dst_port; > return half[0] == bpf_htons(0xcafe); > } > > Through some cross-function optimization by ArgumentPromotion > optimization, the compiler does: > /* the below shared by sk_dst_port__load_word > * and sk_dst_port__load_half > */ > __u32 *word = (__u32 *)&sk->dst_port; > __u32 word_val = word[0]; > > /* the below is for sk_dst_port__load_half only */ > __u16 half_val = word_val >> 16; > > ... half_val passed into sk_dst_port__load_half ... > return half_val == bpf_htons(0xcafe); > > Here, 'word_val = word[0]' is replaced by 2-byte load > by the verifier and this caused the trouble for later > sk_dst_port__load_half(). > > I don't have a good solution here. The issue is exposed > as we have both u16 and u32 load for &sk->dst_port and > the compiler did some optimization with this. > > I would say this is an extreme corner case and we can just > fix in the source code like below: > > diff --git a/tools/testing/selftests/bpf/progs/test_sock_fields.c > b/tools/testing/selftests/bpf/progs/test_sock_fields.c > index bbad3c2d9aa5..39c975786720 100644 > --- a/tools/testing/selftests/bpf/progs/test_sock_fields.c > +++ b/tools/testing/selftests/bpf/progs/test_sock_fields.c > @@ -265,7 +265,10 @@ static __noinline bool > sk_dst_port__load_word(struct bpf_sock *sk) > > static __noinline bool sk_dst_port__load_half(struct bpf_sock *sk) > { > - __u16 *half = (__u16 *)&sk->dst_port; > + __u16 *half; > + > + asm volatile (""); > + half = (__u16 *)&sk->dst_port; > return half[0] == bpf_htons(0xcafe); > } > > Could you try whether the above workaround works or not? > If we want the code to be future proof for potential > cross-func optimization for these noinline functions, we > can add similar asm codes to all of > bool sk_dst_port__load_{word, half, byte}. Hi, this makes the issue go away, thanks. However, I'm still concerned, because this only inhibits a certain optimization and does not address the underlying fundamental problem: we promise to clang that the in-kernel implementation of the eBPF virtual machine is big-endian, while in reality it's not. As compiler optimizations get more aggressive, we will surely see more of this. Why not do this instead? diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 1bb11a6ee667..3c9b535532ae 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -6102,7 +6102,7 @@ struct bpf_sock { __u32 src_ip4; __u32 src_ip6[4]; __u32 src_port; /* host byte order */ - __be16 dst_port; /* network byte order */ + volatile __be16 dst_port; /* network byte order */ __u16 :16; /* zero padding */ __u32 dst_ip4; __u32 dst_ip6[4]; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 1bb11a6ee667..3c9b535532ae 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -6102,7 +6102,7 @@ struct bpf_sock { __u32 src_ip4; __u32 src_ip6[4]; __u32 src_port; /* host byte order */ - __be16 dst_port; /* network byte order */ + volatile __be16 dst_port; /* network byte order */ __u16 :16; /* zero padding */ __u32 dst_ip4; __u32 dst_ip6[4]; diff --git a/tools/testing/selftests/bpf/progs/test_sock_fields.c b/tools/testing/selftests/bpf/progs/test_sock_fields.c index bbad3c2d9aa5..773ded84ac12 100644 --- a/tools/testing/selftests/bpf/progs/test_sock_fields.c +++ b/tools/testing/selftests/bpf/progs/test_sock_fields.c @@ -259,19 +259,19 @@ int ingress_read_sock_fields(struct __sk_buff *skb) */ static __noinline bool sk_dst_port__load_word(struct bpf_sock *sk) { - __u32 *word = (__u32 *)&sk->dst_port; + volatile __u32 *word = (volatile __u32 *)&sk->dst_port; return word[0] == bpf_htons(0xcafe); } static __noinline bool sk_dst_port__load_half(struct bpf_sock *sk) { - __u16 *half = (__u16 *)&sk->dst_port; + volatile __u16 *half = (volatile __u16 *)&sk->dst_port; return half[0] == bpf_htons(0xcafe); } static __noinline bool sk_dst_port__load_byte(struct bpf_sock *sk) { - __u8 *byte = (__u8 *)&sk->dst_port; + volatile __u8 *byte = (volatile __u8 *)&sk->dst_port; return byte[0] == 0xca && byte[1] == 0xfe; } This also works, and as far as I'm concerned, this would be a proper fix for the underlying issue: we tell the compiler that it should never ever (with any of today's or future optimizations) try to be clever when accessing dst_port. Best regards, Ilya > > > > [1] > > > > https://lore.kernel.org/bpf/20220317113920.1068535-5-jakub@xxxxxxxxxxxxxx