On Mon, 2023-05-15 at 08:27 -0700, Yonghong Song wrote: > > > On 5/15/23 12:55 AM, Ilya Leoshkevich wrote: > > On Sun, 2023-05-14 at 09:58 -0700, Yonghong Song wrote: > > > > > > > > > On 5/13/23 1:24 AM, Ilya Leoshkevich wrote: > > > > On Fri, 2023-05-12 at 21:13 -0700, Yonghong Song wrote: > > > > > > > > > > > > > > > On 5/12/23 7:40 AM, Ilya Leoshkevich wrote: > > > > > > On Wed, 2023-05-03 at 21:46 +0200, Ilya Leoshkevich wrote: > > > > > > > On Wed, 2023-05-03 at 12:35 -0700, Yonghong Song wrote: > > > > > > > > Hi, Ilya, > > > > > > > > > > > > > > > > BPF CI ([1]) detected a s390x failure when bpf program > > > > > > > > is > > > > > > > > compiled > > > > > > > > with > > > > > > > > latest llvm17 on bpf-next branch. To reproduce the > > > > > > > > issue, > > > > > > > > just > > > > > > > > run > > > > > > > > normal './test_progs -j'. The failure log looks like > > > > > > > > below: > > > > > > > > > > > > > > > > Notice: Success: 341/3015, Skipped: 29, Failed: 1 > > > > > > > > Error: #191 sock_fields > > > > > > > > Error: #191 sock_fields > > > > > > > > create_netns:PASS:create netns 0 nsec > > > > > > > > create_netns:PASS:bring up lo 0 nsec > > > > > > > > > > > > > > > > serial_test_sock_fields:PASS:test_sock_fields__open_and > > > > > > > > _loa > > > > > > > > d 0 > > > > > > > > nsec > > > > > > > > > > > > > > > > serial_test_sock_fields:PASS:attach_cgroup(egress_read_ > > > > > > > > sock > > > > > > > > _fie > > > > > > > > lds) > > > > > > > > 0 > > > > > > > > nsec > > > > > > > > > > > > > > > > serial_test_sock_fields:PASS:attach_cgroup(ingress_read > > > > > > > > _soc > > > > > > > > k_fi > > > > > > > > elds > > > > > > > > ) > > > > > > > > 0 nsec > > > > > > > > > > > > > > > > serial_test_sock_fields:PASS:attach_cgroup(read_sk_dst_ > > > > > > > > port > > > > > > > > 0 > > > > > > > > nsec > > > > > > > > test:PASS:getsockname(listen_fd) 0 nsec > > > > > > > > test:PASS:getsockname(cli_fd) 0 nsec > > > > > > > > test:PASS:accept(listen_fd) 0 nsec > > > > > > > > > > > > > > > > init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt > > > > > > > > _fd) > > > > > > > > 0 > > > > > > > > nsec > > > > > > > > > > > > > > > > init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt > > > > > > > > 10_f > > > > > > > > d) 0 > > > > > > > > nsec > > > > > > > > test:PASS:send(accept_fd) 0 nsec > > > > > > > > test:PASS:recv(cli_fd) 0 nsec > > > > > > > > test:PASS:send(accept_fd) 0 nsec > > > > > > > > test:PASS:recv(cli_fd) 0 nsec > > > > > > > > test:PASS:recv(accept_fd) for fin 0 nsec > > > > > > > > test:PASS:recv(cli_fd) for fin 0 nsec > > > > > > > > > > > > > > > > check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_ou > > > > > > > > t_cn > > > > > > > > t, > > > > > > > > &accept_fd) 0 nsec > > > > > > > > > > > > > > > > check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_ou > > > > > > > > t_cn > > > > > > > > t, > > > > > > > > &cli_fd) 0 nsec > > > > > > > > > > > > > > > > check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0 > > > > > > > > nsec > > > > > > > > > > > > > > > > check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0 > > > > > > > > nsec > > > > > > > > > > > > > > > > check_result:PASS:bpf_map_lookup_elem(linum_map_fd, > > > > > > > > READ_SK_DST_PORT_IDX) 0 nsec > > > > > > > > check_result:FAIL:failure in read_sk_dst_port on > > > > > > > > line > > > > > > > > unexpected > > > > > > > > failure in read_sk_dst_port on line: actual 297 != > > > > > > > > expected > > > > > > > > 0 > > > > > > > > listen_sk: state:10 bound_dev_if:0 family:10 > > > > > > > > type:1 > > > > > > > > protocol:6 > > > > > > > > mark:0 > > > > > > > > priority:0 src_ip4:7f000006(127.0.0.6) > > > > > > > > src_ip6:0:0:0:1(::1) > > > > > > > > src_port:51966 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:0(::) > > > > > > > > dst_port:0 > > > > > > > > srv_sk: state:9 bound_dev_if:0 family:10 type:1 > > > > > > > > protocol:6 > > > > > > > > mark:0 > > > > > > > > priority:0 src_ip4:7f000006(127.0.0.6) > > > > > > > > src_ip6:0:0:0:1(::1) > > > > > > > > src_port:51966 dst_ip4:7f000006(127.0.0.6) > > > > > > > > dst_ip6:0:0:0:1(::1) > > > > > > > > dst_port:38030 > > > > > > > > cli_sk: state:5 bound_dev_if:0 family:10 type:1 > > > > > > > > protocol:6 > > > > > > > > mark:0 > > > > > > > > priority:0 src_ip4:7f000006(127.0.0.6) > > > > > > > > src_ip6:0:0:0:1(::1) > > > > > > > > src_port:38030 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:1(::1) > > > > > > > > dst_port:51966 > > > > > > > > listen_tp: snd_cwnd:10 srtt_us:0 > > > > > > > > rtt_min:4294967295 > > > > > > > > snd_ssthresh:2147483647 rcv_nxt:0 snd_nxt:0 snd:una:0 > > > > > > > > mss_cache:536 > > > > > > > > ecn_flags:0 rate_delivered:0 rate_interval_us:0 > > > > > > > > packets_out:0 > > > > > > > > retrans_out:0 total_retrans:0 segs_in:0 data_segs_in:0 > > > > > > > > segs_out:0 > > > > > > > > data_segs_out:0 lost_out:0 sacked_out:0 > > > > > > > > bytes_received:0 > > > > > > > > bytes_acked:0 > > > > > > > > srv_tp: snd_cwnd:10 srtt_us:3904 rtt_min:272 > > > > > > > > snd_ssthresh:2147483647 > > > > > > > > rcv_nxt:648617715 snd_nxt:4218065810 snd:una:4218065810 > > > > > > > > mss_cache:32768 > > > > > > > > ecn_flags:0 rate_delivered:1 rate_interval_us:272 > > > > > > > > packets_out:0 > > > > > > > > retrans_out:0 total_retrans:0 segs_in:5 data_segs_in:0 > > > > > > > > segs_out:3 > > > > > > > > data_segs_out:2 lost_out:0 sacked_out:0 > > > > > > > > bytes_received:1 > > > > > > > > bytes_acked:22 > > > > > > > > cli_tp: snd_cwnd:10 srtt_us:6035 rtt_min:730 > > > > > > > > snd_ssthresh:2147483647 > > > > > > > > rcv_nxt:4218065811 snd_nxt:648617715 snd:una:648617715 > > > > > > > > mss_cache:32768 > > > > > > > > ecn_flags:0 rate_delivered:1 rate_interval_us:925 > > > > > > > > packets_out:0 > > > > > > > > retrans_out:0 total_retrans:0 segs_in:4 data_segs_in:2 > > > > > > > > segs_out:6 > > > > > > > > data_segs_out:0 lost_out:0 sacked_out:0 > > > > > > > > bytes_received:23 > > > > > > > > bytes_acked:2 > > > > > > > > check_result:PASS:listen_sk 0 nsec > > > > > > > > check_result:PASS:srv_sk 0 nsec > > > > > > > > check_result:PASS:srv_tp 0 nsec > > > > > > > > > > > > > > > > If bpf program is compiled with llvm16, the test passed > > > > > > > > according > > > > > > > > to > > > > > > > > a CI run. > > > > > > > > > > > > > > > > I don't have s390x environment to debug this. Could you > > > > > > > > help > > > > > > > > debug > > > > > > > > it? > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > [1] > > > > > > > > https://github.com/kernel-patches/vmtest/actions/runs/4866851496/jobs/8679080985?pr=224#step:6:7645 > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > thank for letting me know. > > > > > > > I will look into this. > > > > > > > > > > > > > > Best regards, > > > > > > > Ilya > > > > > > > > > > > > In the meantime the issue was fixed by: > > > > > > > > > > > > commit 141be5c062ecf22bd287afffd310e8ac4711444a > > > > > > Author: Shoaib Meenai <smeenai@xxxxxx> > > > > > > Date: Fri May 5 14:18:12 2023 -0700 > > > > > > > > > > > > Revert "Reland [Pipeline] Don't limit > > > > > > ArgumentPromotion > > > > > > to - > > > > > > O3" > > > > > > > > > > > > This reverts commit > > > > > > 6f29d1adf29820daae9ea7a01ae2588b67735b9e. > > > > > > > > > > > > https://reviews.llvm.org/D149768 ; is causing size > > > > > > regressions > > > > > > for - > > > > > > Oz > > > > > > with FullLTO, and I'm reverting that one while > > > > > > investigating. > > > > > > This > > > > > > commit depends on that one, so it needs to be > > > > > > reverted as > > > > > > well. > > > > > > > > > > The transformtion "Don't limit ArgumentPromotion to -O3" is > > > > > temporarily > > > > > reverted. But it could be reverted again once the issue is > > > > > resolved. > > > > > So it is a good idea to resolve the issue in the kernel. > > > > > > > > > > > > > > > > > But looking at the codegen differences: > > > > > > > > > > > > $ diff -u <(sed -e s/[0-9]*://g pass.s) <(sed -e s/[0- > > > > > > 9]*://g > > > > > > fail.s) > > > > > > > > > > > > -pass.o: file format elf64-bpf > > > > > > +fail.o: file format elf64-bpf > > > > > > > > > > > > -00000000000002c8 <sk_dst_port__load_half> > > > > > > - 69 11 00 30 00 00 00 00 r1 = *(u16 *)(r1 + 48) > > > > > > +00000000000002c0 <sk_dst_port__load_half> > > > > > > + 54 10 00 00 00 00 ff ff w1 &= 65535 > > > > > > b4 00 00 00 00 00 00 01 w0 = 1 > > > > > > 16 10 00 01 00 00 ca fe if w1 == 51966 goto +1 > > > > > > <LBB6_2> > > > > > > b4 00 00 00 00 00 00 00 w0 = 0 > > > > > > > > > > > > This is what ArgumentPromotion is supposed to do, so that's > > > > > > okay so > > > > > > far. However, further down below we have: > > > > > > > > > > > > Disassembly of section cgroup_skb/egress: > > > > > > > > > > > > - bf 16 00 00 00 00 00 00 r1 = r6 > > > > > > + 61 76 00 30 00 00 00 00 r7 = *(u32 *)(r6 + 48) > > > > > > + bc 17 00 00 00 00 00 00 w1 = w7 > > > > > > 85 01 00 00 00 00 00 53 call > > > > > > sk_dst_port__load_word > > > > > > > > > > > > ... > > > > > > > > > > > > - bf 16 00 00 00 00 00 00 r1 = r6 > > > > > > + 74 70 00 00 00 00 00 10 w7 >>= 16 > > > > > > + bc 17 00 00 00 00 00 00 w1 = w7 > > > > > > 85 01 00 00 00 00 00 57 call > > > > > > sk_dst_port__load_half > > > > > > > > > > > > so there is no 16-bit load anymore, instead, the result > > > > > > from > > > > > > the > > > > > > earlier 32-bit load is reused. However, on BE these kinds > > > > > > of > > > > > > loads > > > > > > for this particular field are not consistent at the moment > > > > > > - > > > > > > see > > > > > > [1] > > > > > > and the previous discussions. > > > > > > > > > > > > De-facto we have the following results: > > > > > > > > > > > > - int load: 0x0000cafe > > > > > > - short load: 0xcafe > > > > > > > > > > So 'De-facto' means the above is the expected result. > > > > > > > > > > > > > > > > > On a consistent BE we should have rather had: > > > > > > > > > > > > - int load: 0x0000cafe > > > > > > - short load: 0 > > > > > > > > > > What does 'consistent BE' mean here? Does it mean the > > > > > expected > > > > > result from the source code itself? > > > > > > > > I should not have called the de-facto example "BE" at all: it's > > > > rather > > > > "mixed endianness" or "weird endianness" or something along > > > > these > > > > lines. > > > > > > > > On "consistent BE" or simply "BE" properties like > > > > > > > > *(uint32_t *)p = (*(uint16_t *)p << 16) | *(uint16_t *)(p + 2); > > > > > > > > hold. This is currently not the case for bpf_sock.dst_port. > > > > > > > > We compile with -mbig-endian, so we promise to the compiler > > > > that > > > > the > > > > machine is big-endian, and the compiler expects the above to > > > > hold > > > > for > > > > any p. Unfortunately when p points to bpf_sock.dst_port, this > > > > is > > > > not > > > > the case. > > > > > > If I understand correctly, *(uint32_t *)p to get the > > > bpf_sock.dst_port > > > is for backward compatibility. But the real u32 read by compiler > > > will > > > do (*(uint16_t *)p << 16) | *(uint16_t *)(p + 2) which is not the > > > same as expected *(uint32_t *)p so we have problem here. > > > > > > > > > > > The property above is important for the correctness of the > > > > load/store > > > > tearing transformations. What we have here is not exactly > > > > tearing, > > > > but > > > > is quite close. > > > > > > > > > > Clang, of course, expects a consistent BE and optimizes > > > > > > around > > > > > > that. > > > > > > > > > > > > This was a conscious tradeoff Jakub and I have agreed on in > > > > > > order > > > > > > to > > > > > > keep the quirky behavior from the past. Given what's > > > > > > happening > > > > > > with > > > > > > Clang now, I wonder if it would be worth revisiting it in > > > > > > the > > > > > > name > > > > > > of > > > > > > consistency? > > > > > > > > > > If I understand correctly, I think the issue is > > > > > r7 = *(u32 *)(r6 + 48) > > > > > w7 >= 16 > > > > > w1 = w7 > > > > > > > > > > after verifier, it is changed to > > > > > r7 = *(u16 *)(r6 + <kernel offset>) > > > > > w7 >= 16 > > > > > w1 = w7 > > > > > > > > > > and the result after verifier rewrite is completely wrong. > > > > > Is it right? > > > > > > > > No, the verifier rewrite is correct. > > > > The sk_dst_port__load_word() part of the test succeeds. > > > > > > > > All these rewrites already work fine, they are correct and > > > > consistent. > > > > It's really just bpf_sock.dst_port that is special. > > > > > > > > > If this is the case, during verifier rewrite, if it is > > > > > big endian, say the user intends to load 4 bytes (from uapi > > > > > header) > > > > > while the kernel field is 2 bytes, in such cases, kernel > > > > > can still pretend to generate 4-byte load. For example, > > > > > for the above example, the code after verification could be: > > > > > r7 = *(u16 *)(r6 + <kernel offset>) > > > > > r7 <= 16 > > > > > w7 >= 16 > > > > > w1 = w7 > > > > > > > > > > Hopefully, we won't have many such cases. Does this work? > > > > > > > > This would break the sk_dst_port__load_word() part of the test. > > > > > > This is a hack. This may work for this specific u16 case, but > > > yes, it won't work for u32 load case. > > > > > > > > > > > > > > > > > > > Above I asked whether we can resolve the inconsistency, but I > > > > thought > > > > about it and I don't see a way of doing it without breaking the > > > > ABI, > > > > which is at worst unacceptable, and at best a last resort > > > > measure. > > > > > > > > What do you think about marking bpf_sock.dst_port volatile? > > > > volatile > > > > should prevent tearing and similar optimizations, with which we > > > > have a > > > > problem here. > > > > > > > > We could also add a comment warning users not to cast away this > > > > volatile due to the quirk we have. And then we should adjust > > > > the > > > > test > > > > (making all casts volatile) to comply with this new warning. > > > > > > I did a little study on this. The main problem here for > > > static __noinline bool sk_dst_port__load_half(struct bpf_sock > > > *sk) > > > { > > > __u16 *half = (__u16 *)&sk->dst_port; > > > return half[0] == bpf_htons(0xcafe); > > > } > > > > > > Through some cross-function optimization by ArgumentPromotion > > > optimization, the compiler does: > > > /* the below shared by sk_dst_port__load_word > > > * and sk_dst_port__load_half > > > */ > > > __u32 *word = (__u32 *)&sk->dst_port; > > > __u32 word_val = word[0]; > > > > > > /* the below is for sk_dst_port__load_half only */ > > > __u16 half_val = word_val >> 16; > > > > > > ... half_val passed into sk_dst_port__load_half ... > > > return half_val == bpf_htons(0xcafe); > > > > > > Here, 'word_val = word[0]' is replaced by 2-byte load > > > by the verifier and this caused the trouble for later > > > sk_dst_port__load_half(). > > > > > > I don't have a good solution here. The issue is exposed > > > as we have both u16 and u32 load for &sk->dst_port and > > > the compiler did some optimization with this. > > > > > > I would say this is an extreme corner case and we can just > > > fix in the source code like below: > > > > > > diff --git a/tools/testing/selftests/bpf/progs/test_sock_fields.c > > > b/tools/testing/selftests/bpf/progs/test_sock_fields.c > > > index bbad3c2d9aa5..39c975786720 100644 > > > --- a/tools/testing/selftests/bpf/progs/test_sock_fields.c > > > +++ b/tools/testing/selftests/bpf/progs/test_sock_fields.c > > > @@ -265,7 +265,10 @@ static __noinline bool > > > sk_dst_port__load_word(struct bpf_sock *sk) > > > > > > static __noinline bool sk_dst_port__load_half(struct bpf_sock > > > *sk) > > > { > > > - __u16 *half = (__u16 *)&sk->dst_port; > > > + __u16 *half; > > > + > > > + asm volatile (""); > > > + half = (__u16 *)&sk->dst_port; > > > return half[0] == bpf_htons(0xcafe); > > > } > > > > > > Could you try whether the above workaround works or not? > > > If we want the code to be future proof for potential > > > cross-func optimization for these noinline functions, we > > > can add similar asm codes to all of > > > bool sk_dst_port__load_{word, half, byte}. > > > > Hi, > > > > this makes the issue go away, thanks. > > > > However, I'm still concerned, because this only inhibits a certain > > optimization and does not address the underlying fundamental > > problem: > > we promise to clang that the in-kernel implementation of the eBPF > > virtual machine is big-endian, while in reality it's not. As > > compiler > > optimizations get more aggressive, we will surely see more of this. > > > > Why not do this instead? > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > index 1bb11a6ee667..3c9b535532ae 100644 > > --- a/include/uapi/linux/bpf.h > > +++ b/include/uapi/linux/bpf.h > > @@ -6102,7 +6102,7 @@ struct bpf_sock { > > __u32 src_ip4; > > __u32 src_ip6[4]; > > __u32 src_port; /* host byte order */ > > - __be16 dst_port; /* network byte order */ > > + volatile __be16 dst_port; /* network byte order */ > > __u16 :16; /* zero padding */ > > __u32 dst_ip4; > > __u32 dst_ip6[4]; > > diff --git a/tools/include/uapi/linux/bpf.h > > b/tools/include/uapi/linux/bpf.h > > index 1bb11a6ee667..3c9b535532ae 100644 > > --- a/tools/include/uapi/linux/bpf.h > > +++ b/tools/include/uapi/linux/bpf.h > > @@ -6102,7 +6102,7 @@ struct bpf_sock { > > __u32 src_ip4; > > __u32 src_ip6[4]; > > __u32 src_port; /* host byte order */ > > - __be16 dst_port; /* network byte order */ > > + volatile __be16 dst_port; /* network byte order */ > > __u16 :16; /* zero padding */ > > __u32 dst_ip4; > > __u32 dst_ip6[4]; > > diff --git a/tools/testing/selftests/bpf/progs/test_sock_fields.c > > b/tools/testing/selftests/bpf/progs/test_sock_fields.c > > index bbad3c2d9aa5..773ded84ac12 100644 > > --- a/tools/testing/selftests/bpf/progs/test_sock_fields.c > > +++ b/tools/testing/selftests/bpf/progs/test_sock_fields.c > > @@ -259,19 +259,19 @@ int ingress_read_sock_fields(struct __sk_buff > > *skb) > > */ > > static __noinline bool sk_dst_port__load_word(struct bpf_sock > > *sk) > > { > > - __u32 *word = (__u32 *)&sk->dst_port; > > + volatile __u32 *word = (volatile __u32 *)&sk->dst_port; > > return word[0] == bpf_htons(0xcafe); > > } > > > > static __noinline bool sk_dst_port__load_half(struct bpf_sock > > *sk) > > { > > - __u16 *half = (__u16 *)&sk->dst_port; > > + volatile __u16 *half = (volatile __u16 *)&sk->dst_port; > > return half[0] == bpf_htons(0xcafe); > > } > > > > static __noinline bool sk_dst_port__load_byte(struct bpf_sock > > *sk) > > { > > - __u8 *byte = (__u8 *)&sk->dst_port; > > + volatile __u8 *byte = (volatile __u8 *)&sk->dst_port; > > return byte[0] == 0xca && byte[1] == 0xfe; > > } > > > > This also works, and as far as I'm concerned, this would be a > > proper > > fix for the underlying issue: we tell the compiler that it should > > never > > ever (with any of today's or future optimizations) try to be clever > > when accessing dst_port. > > The above test_sock_fields.c change should work too. > I think the uapi change is not necessary. The key word 'volatile' > intends to avoid merging two or more identical loads together. > In this particular case, with only uapi changes, the > harmful transformation can still happen since the sk->dst_port > is indeed loaded only once. > > The test_sock_fields.c change itself forces proper load > which is verifier friendly. So I suggest to have > test_sock_fields.c change only. My previously suggested > changes have the same effect, to preserve the verifier friendly > load. Ok, let's do it this way then. Tested-by: Ilya Leoshkevich <iii@xxxxxxxxxxxxx> for the diff you posted above. > > > > > Best regards, > > Ilya > > > > > > > > > > [1] > > > > > > https://lore.kernel.org/bpf/20220317113920.1068535-5-jakub@xxxxxxxxxxxxxx > >