Re: selftest sock_fields failed on s390x with latest llvm17

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 5/15/23 12:55 AM, Ilya Leoshkevich wrote:
On Sun, 2023-05-14 at 09:58 -0700, Yonghong Song wrote:


On 5/13/23 1:24 AM, Ilya Leoshkevich wrote:
On Fri, 2023-05-12 at 21:13 -0700, Yonghong Song wrote:


On 5/12/23 7:40 AM, Ilya Leoshkevich wrote:
On Wed, 2023-05-03 at 21:46 +0200, Ilya Leoshkevich wrote:
On Wed, 2023-05-03 at 12:35 -0700, Yonghong Song wrote:
Hi, Ilya,

BPF CI ([1]) detected a s390x failure when bpf program is
compiled
with
latest llvm17 on bpf-next branch. To reproduce the issue,
just
run
normal './test_progs -j'. The failure log looks like below:

Notice: Success: 341/3015, Skipped: 29, Failed: 1
Error: #191 sock_fields
      Error: #191 sock_fields
      create_netns:PASS:create netns 0 nsec
      create_netns:PASS:bring up lo 0 nsec
serial_test_sock_fields:PASS:test_sock_fields__open_and_loa
d 0
nsec
serial_test_sock_fields:PASS:attach_cgroup(egress_read_sock
_fie
lds)
0
nsec
serial_test_sock_fields:PASS:attach_cgroup(ingress_read_soc
k_fi
elds
)
0 nsec
serial_test_sock_fields:PASS:attach_cgroup(read_sk_dst_port
0
nsec
      test:PASS:getsockname(listen_fd) 0 nsec
      test:PASS:getsockname(cli_fd) 0 nsec
      test:PASS:accept(listen_fd) 0 nsec
init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt_fd)
0
nsec
init_sk_storage:PASS:bpf_map_update_elem(sk_pkt_out_cnt10_f
d) 0
nsec
      test:PASS:send(accept_fd) 0 nsec
      test:PASS:recv(cli_fd) 0 nsec
      test:PASS:send(accept_fd) 0 nsec
      test:PASS:recv(cli_fd) 0 nsec
      test:PASS:recv(accept_fd) for fin 0 nsec
      test:PASS:recv(cli_fd) for fin 0 nsec
check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cn
t,
&accept_fd) 0 nsec
check_sk_pkt_out_cnt:PASS:bpf_map_lookup_elem(sk_pkt_out_cn
t,
&cli_fd) 0 nsec
      check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0
nsec
      check_result:PASS:bpf_map_lookup_elem(linum_map_fd) 0
nsec
      check_result:PASS:bpf_map_lookup_elem(linum_map_fd,
READ_SK_DST_PORT_IDX) 0 nsec
      check_result:FAIL:failure in read_sk_dst_port on line
unexpected
failure in read_sk_dst_port on line: actual 297 != expected
0
      listen_sk: state:10 bound_dev_if:0 family:10 type:1
protocol:6
mark:0
priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
src_port:51966 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:0(::)
dst_port:0
      srv_sk: state:9 bound_dev_if:0 family:10 type:1
protocol:6
mark:0
priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
src_port:51966 dst_ip4:7f000006(127.0.0.6)
dst_ip6:0:0:0:1(::1)
dst_port:38030
      cli_sk: state:5 bound_dev_if:0 family:10 type:1
protocol:6
mark:0
priority:0 src_ip4:7f000006(127.0.0.6) src_ip6:0:0:0:1(::1)
src_port:38030 dst_ip4:0(0.0.0.0) dst_ip6:0:0:0:1(::1)
dst_port:51966
      listen_tp: snd_cwnd:10 srtt_us:0 rtt_min:4294967295
snd_ssthresh:2147483647 rcv_nxt:0 snd_nxt:0 snd:una:0
mss_cache:536
ecn_flags:0 rate_delivered:0 rate_interval_us:0
packets_out:0
retrans_out:0 total_retrans:0 segs_in:0 data_segs_in:0
segs_out:0
data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:0
bytes_acked:0
      srv_tp: snd_cwnd:10 srtt_us:3904 rtt_min:272
snd_ssthresh:2147483647
rcv_nxt:648617715 snd_nxt:4218065810 snd:una:4218065810
mss_cache:32768
ecn_flags:0 rate_delivered:1 rate_interval_us:272
packets_out:0
retrans_out:0 total_retrans:0 segs_in:5 data_segs_in:0
segs_out:3
data_segs_out:2 lost_out:0 sacked_out:0 bytes_received:1
bytes_acked:22
      cli_tp: snd_cwnd:10 srtt_us:6035 rtt_min:730
snd_ssthresh:2147483647
rcv_nxt:4218065811 snd_nxt:648617715 snd:una:648617715
mss_cache:32768
ecn_flags:0 rate_delivered:1 rate_interval_us:925
packets_out:0
retrans_out:0 total_retrans:0 segs_in:4 data_segs_in:2
segs_out:6
data_segs_out:0 lost_out:0 sacked_out:0 bytes_received:23
bytes_acked:2
      check_result:PASS:listen_sk 0 nsec
      check_result:PASS:srv_sk 0 nsec
      check_result:PASS:srv_tp 0 nsec

If bpf program is compiled with llvm16, the test passed
according
to
a CI run.

I don't have s390x environment to debug this. Could you
help
debug
it?

Thanks!

      [1]
https://github.com/kernel-patches/vmtest/actions/runs/4866851496/jobs/8679080985?pr=224#step:6:7645


Hi,

thank for letting me know.
I will look into this.

Best regards,
Ilya

In the meantime the issue was fixed by:

commit 141be5c062ecf22bd287afffd310e8ac4711444a
Author: Shoaib Meenai <smeenai@xxxxxx>
Date:   Fri May 5 14:18:12 2023 -0700

       Revert "Reland [Pipeline] Don't limit ArgumentPromotion
to -
O3"
      This reverts commit
6f29d1adf29820daae9ea7a01ae2588b67735b9e.
      https://reviews.llvm.org/D149768    is causing size
regressions
for -
Oz
       with FullLTO, and I'm reverting that one while
investigating.
This
       commit depends on that one, so it needs to be reverted as
well.

The transformtion "Don't limit ArgumentPromotion to -O3" is
temporarily
reverted. But it could be reverted again once the issue is
resolved.
So it is a good idea to resolve the issue in the kernel.


But looking at the codegen differences:

$ diff -u <(sed -e s/[0-9]*://g pass.s) <(sed -e s/[0-9]*://g
fail.s)

-pass.o:        file format elf64-bpf
+fail.o:        file format elf64-bpf

-00000000000002c8 <sk_dst_port__load_half>
-       69 11 00 30 00 00 00 00 r1 = *(u16 *)(r1 + 48)
+00000000000002c0 <sk_dst_port__load_half>
+       54 10 00 00 00 00 ff ff w1 &= 65535
           b4 00 00 00 00 00 00 01 w0 = 1
           16 10 00 01 00 00 ca fe if w1 == 51966 goto +1
<LBB6_2>
           b4 00 00 00 00 00 00 00 w0 = 0

This is what ArgumentPromotion is supposed to do, so that's
okay so
far. However, further down below we have:

    Disassembly of section cgroup_skb/egress:

-       bf 16 00 00 00 00 00 00 r1 = r6
+       61 76 00 30 00 00 00 00 r7 = *(u32 *)(r6 + 48)
+       bc 17 00 00 00 00 00 00 w1 = w7
           85 01 00 00 00 00 00 53 call sk_dst_port__load_word

...

-       bf 16 00 00 00 00 00 00 r1 = r6
+       74 70 00 00 00 00 00 10 w7 >>= 16
+       bc 17 00 00 00 00 00 00 w1 = w7
           85 01 00 00 00 00 00 57 call sk_dst_port__load_half

so there is no 16-bit load anymore, instead, the result from
the
earlier 32-bit load is reused. However, on BE these kinds of
loads
for this particular field are not consistent at the moment -
see
[1]
and the previous discussions.

De-facto we have the following results:

- int load: 0x0000cafe
- short load: 0xcafe

So 'De-facto' means the above is the expected result.


On a consistent BE we should have rather had:

- int load: 0x0000cafe
- short load: 0

What does 'consistent BE' mean here? Does it mean the expected
result from the source code itself?

I should not have called the de-facto example "BE" at all: it's
rather
"mixed endianness" or "weird endianness" or something along these
lines.

On "consistent BE" or simply "BE" properties like

*(uint32_t *)p = (*(uint16_t *)p << 16) | *(uint16_t *)(p + 2);

hold. This is currently not the case for bpf_sock.dst_port.

We compile with -mbig-endian, so we promise to the compiler that
the
machine is big-endian, and the compiler expects the above to hold
for
any p. Unfortunately when p points to bpf_sock.dst_port, this is
not
the case.

If I understand correctly, *(uint32_t *)p to get the
bpf_sock.dst_port
is for backward compatibility. But the real u32 read by compiler will
do (*(uint16_t *)p << 16) | *(uint16_t *)(p + 2) which is not the
same as expected *(uint32_t *)p so we have problem here.


The property above is important for the correctness of the
load/store
tearing transformations. What we have here is not exactly tearing,
but
is quite close.

Clang, of course, expects a consistent BE and optimizes around
that.

This was a conscious tradeoff Jakub and I have agreed on in
order
to
keep the quirky behavior from the past. Given what's happening
with
Clang now, I wonder if it would be worth revisiting it in the
name
of
consistency?

If I understand correctly, I think the issue is
       r7 = *(u32 *)(r6 + 48)
       w7 >= 16
       w1 = w7

after verifier, it is changed to
      r7 = *(u16 *)(r6 + <kernel offset>)
      w7 >= 16
      w1 = w7

and the result after verifier rewrite is completely wrong.
Is it right?

No, the verifier rewrite is correct.
The sk_dst_port__load_word() part of the test succeeds.

All these rewrites already work fine, they are correct and
consistent.
It's really just bpf_sock.dst_port that is special.

If this is the case, during verifier rewrite, if it is
big endian, say the user intends to load 4 bytes (from uapi
header)
while the kernel field is 2 bytes, in such cases, kernel
can still pretend to generate 4-byte load. For example,
for the above example, the code after verification could be:
      r7 = *(u16 *)(r6 + <kernel offset>)
      r7 <= 16
      w7 >= 16
      w1 = w7

Hopefully, we won't have many such cases. Does this work?

This would break the sk_dst_port__load_word() part of the test.

This is a hack. This may work for this specific u16 case, but
yes, it won't work for u32 load case.




Above I asked whether we can resolve the inconsistency, but I
thought
about it and I don't see a way of doing it without breaking the
ABI,
which is at worst unacceptable, and at best a last resort measure.

What do you think about marking bpf_sock.dst_port volatile?
volatile
should prevent tearing and similar optimizations, with which we
have a
problem here.

We could also add a comment warning users not to cast away this
volatile due to the quirk we have. And then we should adjust the
test
(making all casts volatile) to comply with this new warning.

I did a little study on this. The main problem here for
static __noinline bool sk_dst_port__load_half(struct bpf_sock *sk)
{
          __u16 *half = (__u16 *)&sk->dst_port;
          return half[0] == bpf_htons(0xcafe);
}

Through some cross-function optimization by ArgumentPromotion
optimization, the compiler does:
     /* the below shared by sk_dst_port__load_word
      * and sk_dst_port__load_half
      */
     __u32 *word = (__u32 *)&sk->dst_port;
     __u32 word_val = word[0];

     /* the below is for sk_dst_port__load_half only */
     __u16 half_val = word_val >> 16;

     ... half_val passed into sk_dst_port__load_half ...
     return half_val == bpf_htons(0xcafe);

Here, 'word_val = word[0]' is replaced by 2-byte load
by the verifier and this caused the trouble for later
sk_dst_port__load_half().

I don't have a good solution here. The issue is exposed
as we have both u16 and u32 load for &sk->dst_port and
the compiler did some optimization with this.

I would say this is an extreme corner case and we can just
fix in the source code like below:

diff --git a/tools/testing/selftests/bpf/progs/test_sock_fields.c
b/tools/testing/selftests/bpf/progs/test_sock_fields.c
index bbad3c2d9aa5..39c975786720 100644
--- a/tools/testing/selftests/bpf/progs/test_sock_fields.c
+++ b/tools/testing/selftests/bpf/progs/test_sock_fields.c
@@ -265,7 +265,10 @@ static __noinline bool
sk_dst_port__load_word(struct bpf_sock *sk)

   static __noinline bool sk_dst_port__load_half(struct bpf_sock *sk)
   {
-       __u16 *half = (__u16 *)&sk->dst_port;
+       __u16 *half;
+
+       asm volatile ("");
+       half  = (__u16 *)&sk->dst_port;
          return half[0] == bpf_htons(0xcafe);
   }

Could you try whether the above workaround works or not?
If we want the code to be future proof for potential
cross-func optimization for these noinline functions, we
can add similar asm codes to all of
bool sk_dst_port__load_{word, half, byte}.

Hi,

this makes the issue go away, thanks.

However, I'm still concerned, because this only inhibits a certain
optimization and does not address the underlying fundamental problem:
we promise to clang that the in-kernel implementation of the eBPF
virtual machine is big-endian, while in reality it's not. As compiler
optimizations get more aggressive, we will surely see more of this.

Why not do this instead?

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1bb11a6ee667..3c9b535532ae 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6102,7 +6102,7 @@ struct bpf_sock {
  	__u32 src_ip4;
  	__u32 src_ip6[4];
  	__u32 src_port;		/* host byte order */
-	__be16 dst_port;	/* network byte order */
+	volatile __be16 dst_port;	/* network byte order */
  	__u16 :16;		/* zero padding */
  	__u32 dst_ip4;
  	__u32 dst_ip6[4];
diff --git a/tools/include/uapi/linux/bpf.h
b/tools/include/uapi/linux/bpf.h
index 1bb11a6ee667..3c9b535532ae 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -6102,7 +6102,7 @@ struct bpf_sock {
  	__u32 src_ip4;
  	__u32 src_ip6[4];
  	__u32 src_port;		/* host byte order */
-	__be16 dst_port;	/* network byte order */
+	volatile __be16 dst_port;	/* network byte order */
  	__u16 :16;		/* zero padding */
  	__u32 dst_ip4;
  	__u32 dst_ip6[4];
diff --git a/tools/testing/selftests/bpf/progs/test_sock_fields.c
b/tools/testing/selftests/bpf/progs/test_sock_fields.c
index bbad3c2d9aa5..773ded84ac12 100644
--- a/tools/testing/selftests/bpf/progs/test_sock_fields.c
+++ b/tools/testing/selftests/bpf/progs/test_sock_fields.c
@@ -259,19 +259,19 @@ int ingress_read_sock_fields(struct __sk_buff
*skb)
   */
  static __noinline bool sk_dst_port__load_word(struct bpf_sock *sk)
  {
-	__u32 *word = (__u32 *)&sk->dst_port;
+	volatile __u32 *word = (volatile __u32 *)&sk->dst_port;
  	return word[0] == bpf_htons(0xcafe);
  }
static __noinline bool sk_dst_port__load_half(struct bpf_sock *sk)
  {
-	__u16 *half = (__u16 *)&sk->dst_port;
+	volatile __u16 *half = (volatile __u16 *)&sk->dst_port;
  	return half[0] == bpf_htons(0xcafe);
  }
static __noinline bool sk_dst_port__load_byte(struct bpf_sock *sk)
  {
-	__u8 *byte = (__u8 *)&sk->dst_port;
+	volatile __u8 *byte = (volatile __u8 *)&sk->dst_port;
  	return byte[0] == 0xca && byte[1] == 0xfe;
  }
This also works, and as far as I'm concerned, this would be a proper
fix for the underlying issue: we tell the compiler that it should never
ever (with any of today's or future optimizations) try to be clever
when accessing dst_port.

The above test_sock_fields.c change should work too.
I think the uapi change is not necessary. The key word 'volatile'
intends to avoid merging two or more identical loads together.
In this particular case, with only uapi changes, the
harmful transformation can still happen since the sk->dst_port
is indeed loaded only once.

The test_sock_fields.c change itself forces proper load
which is verifier friendly. So I suggest to have
test_sock_fields.c change only. My previously suggested
changes have the same effect, to preserve the verifier friendly
load.


Best regards,
Ilya


[1]
https://lore.kernel.org/bpf/20220317113920.1068535-5-jakub@xxxxxxxxxxxxxx





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux