On Fri, 2023-01-13 at 09:53 +0100, Jose E. Marchesi wrote: > > On 1/12/23 2:27 PM, Alexei Starovoitov wrote: > > > On Thu, Jan 05, 2023 at 02:07:05PM +0200, Eduard Zingerman wrote: > > > > On Thu, 2023-01-05 at 11:06 +0100, Jose E. Marchesi wrote: > > > > > > On Sat, Dec 31, 2022 at 8:31 AM Eduard Zingerman <eddyz87@xxxxxxxxx> wrote: > > > > > > > > > > > > > > BPF has two documented (non-atomic) memory store instructions: > > > > > > > > > > > > > > BPF_STX: *(size *) (dst_reg + off) = src_reg > > > > > > > BPF_ST : *(size *) (dst_reg + off) = imm32 > > > > > > > > > > > > > > Currently LLVM BPF back-end does not emit BPF_ST instruction and does > > > > > > > not allow one to be specified as inline assembly. > > > > > > > > > > > > > > Recently I've been exploring ways to port some of the verifier test > > > > > > > cases from tools/testing/selftests/bpf/verifier/*.c to use inline assembly > > > > > > > and machinery provided in tools/testing/selftests/bpf/test_loader.c > > > > > > > (which should hopefully simplify tests maintenance). > > > > > > > The BPF_ST instruction is popular in these tests: used in 52 of 94 files. > > > > > > > > > > > > > > While it is possible to adjust LLVM to only support BPF_ST for inline > > > > > > > assembly blocks it seems a bit wasteful. This patch-set contains a set > > > > > > > of changes to verifier necessary in case when LLVM is allowed to > > > > > > > freely emit BPF_ST instructions (source code is available here [1]). > > > > > > > > > > > > Would we gate LLVM's emitting of BPF_ST for C code behind some new > > > > > > cpu=v4? What is the benefit for compiler to start automatically emit > > > > > > such instructions? Such thinking about logistics, if there isn't much > > > > > > benefit, as BPF application owner I wouldn't bother enabling this > > > > > > behavior risking regressions on old kernels that don't have these > > > > > > changes. > > > > > > > > > > Hmm, GCC happily generates BPF_ST instructions: > > > > > > > > > > $ echo 'int v; void foo () { v = 666; }' | bpf-unknown-none-gcc -O2 -xc -S -o foo.s - > > > > > $ cat foo.s > > > > > .file "<stdin>" > > > > > .text > > > > > .align 3 > > > > > .global foo > > > > > .type foo, @function > > > > > foo: > > > > > lddw %r0,v > > > > > stw [%r0+0],666 > > > > > exit > > > > > .size foo, .-foo > > > > > .global v > > > > > .type v, @object > > > > > .lcomm v,4,4 > > > > > .ident "GCC: (GNU) 12.0.0 20211206 (experimental)" > > > > > > > > > > Been doing that since October 2019, I think before the cpu versioning > > > > > mechanism was got in place? > > > > > > > > > > We weren't aware this was problematic. Does the verifier reject such > > > > > instructions? > > > > > > > > Interesting, do BPF selftests generated by GCC pass the same way they > > > > do if generated by clang? > > > > > > > > I had to do the following changes to the verifier to make the > > > > selftests pass when BPF_ST instruction is allowed for selection: > > > > > > > > - patch #1 in this patchset: track values of constants written to > > > > stack using BPF_ST. Currently these are tracked imprecisely, unlike > > > > the writes using BPF_STX, e.g.: > > > > fp[-8] = 42; currently verifier assumes that > > > > fp[-8]=mmmmmmmm > > > > after such instruction, where m stands for "misc", > > > > just a note that something is written at fp[-8]. > > > > r1 = 42; verifier tracks r1=42 after > > > > this instruction. > > > > fp[-8] = r1; verifier tracks fp[-8]=42 after this instruction. > > > > > > > > So the patch makes both cases equivalent. > > > > - patch #3 in this patchset: adjusts > > > > verifier.c:convert_ctx_access() > > > > to operate on BPF_ST alongside BPF_STX. > > > > Context parameters for some BPF programs types are "fake" > > > > data > > > > structures. The verifier matches all BPF_STX and BPF_LDX > > > > instructions that operate on pointers to such contexts and rewrites > > > > these instructions. It might change an offset or add another layer > > > > of indirection, etc. E.g. see filter.c:bpf_convert_ctx_access(). > > > > (This also implies that verifier forbids writes to non-constant > > > > offsets inside such structures). > > > > So the patch extends this logic to also handle BPF_ST. > > > The patch 3 is necessary to land before llvm starts generating 'st' > > > for ctx access. > > > That's clear, but I'm missing why patch 1 is necessary. > > > Sure, it's making the verifier understand scalar spills with 'st' and > > > makes 'st' equivalent to 'stx', but I'm missing why it's necessary. > > > What kind of programs fail to be verified when llvm starts generating 'st' ? I should have added an example to the summary. There are a few test_prog tests that fail w/o this patch, namely atomic_bounds, dynptr, for_each, xdp_noinline, xdp_synproxy. Here is how atomic_bounds looks: SEC("fentry/bpf_fentry_test1") int BPF_PROG(sub, int x) { int a = 0; int b = __sync_fetch_and_add(&a, 1); /* b is certainly 0 here. Can the verifier tell? */ while (b) continue; return 0; } Compiled w/o BPF_ST Compiled with BPF_ST <sub>: <sub>: w1 = 0x0 *(u32 *)(r10 - 0x4) = r1 *(u32 *)(r10 - 0x4) = 0x0 w1 = 0x1 w1 = 0x1 lock *(u32 *)(r10 - 0x4) += r1 lock *(u32 *)(r10 - 0x4) += r1 if w1 == 0x0 goto +0x1 <LBB0_2> if w1 == 0x0 goto +0x1 <LBB0_2> <LBB0_1>: <LBB0_1>: goto -0x1 <LBB0_1> goto -0x1 <LBB0_1> <LBB0_2>: <LBB0_2>: w0 = 0x0 w0 = 0x0 exit exit When compiled with BPF_ST and verified w/o the patch #1 verification log looks as follows: 0: R1=ctx(off=0,imm=0) R10=fp0 0: (62) *(u32 *)(r10 -4) = 0 ; R10=fp0 fp-8=mmmm???? 1: (b4) w1 = 1 ; R1_w=1 2: (c3) r1 = atomic_fetch_add((u32 *)(r10 -4), r1) ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R10=fp0 fp-8=mmmm???? 3: (16) if w1 == 0x0 goto pc+1 ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) 4: (05) goto pc-1 4: (05) goto pc-1 4: (05) goto pc-1 4: (05) goto pc-1 infinite loop detected at insn 4 When compiled w/o BPF_ST and verified w/o the patch #1 verification log looks as follows: func#0 @0 reg type unsupported for arg#0 function sub#5 0: R1=ctx(off=0,imm=0) R10=fp0 0: (b4) w1 = 0 ; R1_w=0 1: (63) *(u32 *)(r10 -4) = r1 last_idx 1 first_idx 0 regs=2 stack=0 before 0: (b4) w1 = 0 2: R1_w=0 R10=fp0 fp-8=0000???? 2: (b4) w1 = 1 ; R1_w=1 ; int b = __sync_fetch_and_add(&a, 1); 3: (c3) r1 = atomic_fetch_add((u32 *)(r10 -4), r1) ; R1_w=P0 R10=fp0 fp-8=mmmm???? 4: (16) if w1 == 0x0 goto pc+1 6: R1_w=P0 6: (b4) w0 = 0 ; R0_w=0 7: (95) exit processed 7 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 The difference comes from the way zero write to `r10-4` is processed, with BPF_ST it is tracked as `fp-8=mmmm????` after write, without BPF_ST it is tracked as `fp-8=0000???? after` write. Which is caused by the way `check_stack_write_fixed_off()` is written. For the register spills it either saves the complete register state or STACK_ZERO if register is known to be zero. However, for the BPF_ST it just saves STACK_MISC. Hence, the patch #1. > > > Regarind -mcpu=v4. > > > I think we need to add all of our upcoming instructions as a single flag. > > > Otherwise we'll have -mcpu=v5,v6,v7 and full combinations of them. > > > -mcpu=v4 could mean: > > > - ST > > > - sign extending loads > > > - sign extend a register > > > - 32-bit JA > > > - proper bswap insns: bswap16, bswap32, bswap64 > > > The sign and 32-bit JA we've discussed earlier. > > > The bswap was on my wish list forever. > > > The existing TO_LE, TO_BE insns are really odd from compiler pov. > > > The compiler should translate bswap IR op into proper bswap insn > > > just like it does on all cpus. > > > Maybe add SDIV to -mcpu=v4 as well? There is also BPF_JSET "PC += off if dst & src" which is not currently emitted by LLVM backend as far as I can tell. > > > > Right, we should add these insns in llvm17 with -mcpu=v4, so we > > can keep the number of cpu generations minimum. > > How do you plan to encode the sign-extend load instructions? > > I guess a possibility would be to use one of the available op-mode for > load instructions that are currently marked as reserved. For example: > > IMM = 0b000 > ABS = 0b001 > IND = 0b010 > MEM = 0b011 > SEM = 0b100 <- new > > Then we would have the following code fields for sign-extending LDX > instructions: > > op-mode:SEM op-size:{W,H,B,DW} op-class:LDX I second the Jose's question about encoding for the new instructions. > - sign extend a register E.g. like this: BPF_SEXT = 0xb0 opcode: BPF_SEXT | BPF_X | BPF_ALU imm32: {8,16,32} // to be consistent with BPF_END insn or BPF_{B,H,W} // to be consistent with LDX/STX Sign extend 8,16,32-bit value from src to 64-bit dst register: src = sext{8,16,32}(dst) > The sign and 32-bit JA we've discussed earlier. Could you please share a link for this discussion? > - proper bswap insns: bswap16, bswap32, bswap64 Could you please extrapolate on this. Current instructions operate on a single register, e.g.: dst_reg = htole16(dst_reg). It looks like this could be reflected in x86 assembly as a single instruction using either bswap or xchg instructions (see [1] and [2]). x86/net/bpf_jit_comp.c:1266 can probably be adjusted to use xchg for 16-bit case. For ARM rev16, rev32, rev64 allow to use the same register for source and destination ([3]). [1] https://www.felixcloutier.com/x86/bswap [2] https://www.felixcloutier.com/x86/xchg [3] https://developer.arm.com/documentation/ddi0602/2022-06/Base-Instructions/REV16--Reverse-bytes-in-16-bit-halfwords-?lang=en https://developer.arm.com/documentation/ddi0602/2022-06/Base-Instructions/REV32--Reverse-bytes-in-32-bit-words-?lang=en https://developer.arm.com/documentation/ddi0602/2022-06/Base-Instructions/REV64--Reverse-Bytes--an-alias-of-REV-?lang=en