> On May 17, 2022, at 2:10 AM, Christoph Hellwig <hch@xxxxxx> wrote: > > On Wed, May 11, 2022 at 07:39:34PM -0700, Alexei Starovoitov wrote: >> Turns out that clang -mcpu=v1,v2,v3 are not exactly correct. >> We've extended ISA more than three times. >> For example when we added more atomics insns in >> https://lore.kernel.org/bpf/20210114181751.768687-1-jackmanb@xxxxxxxxxx/ >> >> The corresponding llvm diff didn't add a new -mcpu flavor. >> There was no need to do it. > > .. because all uses of these new instructions are through builtins > that wouldn't other be available, yes. > >> Also llvm flags can turn a subset of insns on and off. >> Like llvm can turn on alu32, but without <,<= insns. >> -mcpu=v3 includes both. -mcpu=v2 are only <,<=. >> >> So we need a plan B. > > Yes. > >> How about using the commit sha where support was added to the verifier >> as a 'version' of the ISA ? >> >> We can try to use a kernel version, but backports >> will ruin that picture. >> Looks like upstream 'commit sha' is the only stable number. > > Using kernel release hashes is a pretty horrible interface, especially > for non-kernel users. I also think the compilers and other tools > would really like some vaguely meaninfuly identifiers. > >> Another approach would be to declare the current ISA as >> 1.0 (or bpf-isa-may-2022) and >> in the future bump it with every new insn. > > I think that is a much more reasonable starting position. However, are > we sure the ISA actually evolves linearly? As far as I can tell support > for full atomics only exists for a few JITs so far. > > So maybe starting with a basedline, and then just have name for > each meaningful extension (e.g. the full atomics as a start) might be > even better. For the Linux kernel case we can then also have a user > interface where userspace programs can query which extensions are > supported before loading eBPF programs that rely on them instead of > doing a roundtrip through the verifier. > >>> - we need to decide to do about the legacy BPF packet access >>> instrutions. Alexei mentioned that the modern JIT doesn't >>> even use those internally any more. >> >> I think we need to document them as supported in the linux kernel, >> but deprecated in general. >> The standard might say "implementation defined" meaning that >> different run-times don't have to support them. > > Yeah. If we do the extensions proposal above we could make these > a specific extension as well. > >> [...] >> I don't think it's worth documenting all that. >> I would group all undefined/underflow/overflow as implementation >> defined and document only things that matter. > > Makese sense. > >>> Discussion on where to host a definitive version of the document: >>> >>> - I think the rough consensus is to just host regular (hopefully >>> low cadence) documents and maybe the latest gratest at a eBPF >>> foundation website. Whom do we need to work with at the fundation >>> to make this happen? >> >> foundation folks cc-ed. > > I'd be rally glad if we could kick off this ASAP. Feel free to contact > me privately if we want to keep it off the list. > >>> - as idea it was brought up to write a doument with the minimal >>> verification requirements required for any eBPF implementation >>> independent of the program type. Again I can volunteer to >>> draft a documentation, but I need input on what such a consensus >>> would be. In this case input from the non-Linux verifier >>> implementors (I only know the Microsoft research one) would >>> be very helpful as well. >> >> The verifier is a moving target. > > Absolutely. > >> I'd say minimal verification is the one that checks that: >> - instructions are formed correctly >> - opcode is valid >> - no reserved bits are used >> - registers are within range (r11+ are not used) >> - combination of opcode+regs+off+imm is valid >> - simple things like that > > Sounds good. One useful thing for this would be an opcode table > with all the optional field usage in machine readable format. > > Jim who is on CC has already built a nice table off all opcodes based > on existing material that might be a good starting point. Table is inline below. I used the tables in the iovisor project documentation as a starting point for this table. https://github.com/iovisor/bpf-docs/blob/master/eBPF.md. Feedback welcome. The atomic sections at the bottom could especially use some careful review for correctness. BPF_ALU64 (opc & 0x07 == 0x07) ============================== 0x07 BPF_ALU64 | BPF_K | BPF_ADD bpf_add dst, imm dst += imm 0x0f BPF_ALU64 | BPF_X | BPF_ADD bpf_add dst, src dst += src 0x17 BPF_ALU64 | BPF_K | BPF_SUB bpf_sub dst, imm dst -= imm 0x1f BPF_ALU64 | BPF_X | BPF_SUB bpf_sub dst, src dst -= src 0x27 BPF_ALU64 | BPF_K | BPF_MUL bpf_mul dst, imm dst *= imm 0x2f BPF_ALU64 | BPF_X | BPF_MUL bpf_mul dst, src dst *= src 0x37 BPF_ALU64 | BPF_K | BPF_DIV bpf_div dst, imm dst /= imm 0x3f BPF_ALU64 | BPF_X | BPF_DIV bpf_div dst, src dst /= src 0x47 BPF_ALU64 | BPF_K | BPF_OR bpf_or dst, imm dst |= imm 0x4f BPF_ALU64 | BPF_X | BPF_OR bpf_or dst, src dst |= src 0x57 BPF_ALU64 | BPF_K | BPF_AND bpf_and dst, imm dst &= imm 0x5f BPF_ALU64 | BPF_X | BPF_AND bpf_and dst, src dst &= src 0x67 BPF_ALU64 | BPF_K | BPF_LSH bpf_lsh dst, imm dst <<= imm 0x6f BPF_ALU64 | BPF_X | BPF_LSH bpf_lsh dst, src dst <<= src 0x77 BPF_ALU64 | BPF_K | BPF_RSH bpf_lsh dst, imm dst >>= imm (logical) 0x7f BPF_ALU64 | BPF_X | BPF_RSH bpf_lsh dst, src dst >>= src (logical) 0x87 BPF_ALU64 | BPF_K | BPF_NEG bpf_neg dst dst = ~dst 0x97 BPF_ALU64 | BPF_K | BPF_MOD bpf_mod dst, imm dst %= imm 0x9f BPF_ALU64 | BPF_X | BPF_MOD bpf_mod dst, src dst %= src 0xa7 BPF_ALU64 | BPF_K | BPF_XOR bpf_xor dst, imm dst ^= imm 0xaf BPF_ALU64 | BPF_X | BPF_XOR bpf_xor dst, src dst ^= src 0xb7 BPF_ALU64 | BPF_K | BPF_MOV bpf_mov dst, imm dst = imm 0xbf BPF_ALU64 | BPF_X | BPF_MOV bpf_mov dst, src dst = src 0xc7 BPF_ALU64 | BPF_K | BPF_ARSH bpf_arsh dst, imm dst >>= imm (arithmetic) 0xcf BPF_ALU64 | BPF_X | BPF_ARSH bpf_arsh dst, src dst >>= src (arithmetic) BPF_ALU (opc & 0x07 == 0x04) ============================ 0x04 BPF_ALU | BPF_K | BPF_ADD bpf_add32 dst, imm dst32 += imm 0x0c BPF_ALU | BPF_X | BPF_ADD bpf_add32 dst, src dst32 += src32 0x14 BPF_ALU | BPF_K | BPF_SUB bpf_sub32 dst, imm dst32 -= imm 0x1c BPF_ALU | BPF_X | BPF_SUB bpf_sub32 dst, src dst32 -= src32 0x24 BPF_ALU | BPF_K | BPF_MUL bpf_mul32 dst, imm dst32 *= imm 0x2c BPF_ALU | BPF_X | BPF_MUL bpf_mul32 dst, src dst32 *= src32 0x34 BPF_ALU | BPF_K | BPF_DIV bpf_div32 dst, imm dst32 /= imm 0x3c BPF_ALU | BPF_X | BPF_DIV bpf_div32 dst, src dst32 /= src32 0x44 BPF_ALU | BPF_K | BPF_OR bpf_or32 dst, imm dst32 |= imm 0x4c BPF_ALU | BPF_X | BPF_OR bpf_or32 dst, src dst32 |= src32 0x54 BPF_ALU | BPF_K | BPF_AND bpf_and32 dst, imm dst32 &= imm 0x5c BPF_ALU | BPF_X | BPF_AND bpf_and32 dst, src dst32 &= src32 0x64 BPF_ALU | BPF_K | BPF_LSH bpf_lsh32 dst, imm dst32 <<= imm 0x6c BPF_ALU | BPF_X | BPF_LSH bpf_lsh32 dst, src dst32 <<= src32 0x74 BPF_ALU | BPF_K | BPF_RSH bpf_lsh32 dst, imm dst32 >>= imm (logical) 0x7c BPF_ALU | BPF_X | BPF_RSH bpf_lsh32 dst, src dst32 >>= src32 (logical) 0x84 BPF_ALU | BPF_K | BPF_NEG bpf_neg32 dst dst32 = ~dst32 0x94 BPF_ALU | BPF_K | BPF_MOD bpf_mod32 dst, imm dst32 %= imm 0x9c BPF_ALU | BPF_X | BPF_MOD bpf_mod32 dst, src dst32 %= src32 0xa4 BPF_ALU | BPF_K | BPF_XOR bpf_xor32 dst, imm dst32 ^= imm 0xac BPF_ALU | BPF_X | BPF_XOR bpf_xor32 dst, src dst32 ^= src32 0xb4 BPF_ALU | BPF_K | BPF_MOV bpf_mov32 dst, imm dst32 = imm 0xbc BPF_ALU | BPF_X | BPF_MOV bpf_mov32 dst, src dst32 = src32 0xc4 BPF_ALU | BPF_K | BPF_ARSH bpf_arsh32 dst, imm dst32 >>= imm (arithmetic) 0xcc BPF_ALU | BPF_X | BPF_ARSH bpf_arsh32 dst, src dst32 >>= src32 (arithmetic) For BPF_END instructions, BPF_K == htole conversion, BPF_X == htobe conversion Operation size (16, 32, 64 bit) determined by 'imm' value of instruction (upper 32 bits) 0xd4 (.imm = 16) BPF_ALU | BPF_K | BPF_END bpf_le16 dst dst16 = htole16(dst16) 0xd4 (.imm = 32) BPF_ALU | BPF_K | BPF_END bpf_le32 dst dst32 = htole32(dst32) 0xd4 (.imm = 64) BPF_ALU | BPF_K | BPF_END bpf_le64 dst dst = htole64(dst) 0xdc (.imm = 16) BPF_ALU | BPF_X | BPF_END bpf_be16 dst dst16 = htobe16(dst16) 0xdc (.imm = 32) BPF_ALU | BPF_X | BPF_END bpf_be32 dst dst32 = htobe32(dst32) 0xdc (.imm = 64) BPF_ALU | BPF_X | BPF_END bpf_be64 dst dst = htobe64(dst) BPF_JMP (opc & 0x07 == 0x05) ============================ 0x05 BPF_JMP | BPF_K | BPF_JA bpf_ja +off PC += off 0x15 BPF_JMP | BPF_K | BPF_JEQ bpf_jeq dst, imm, +off PC += off if dst == imm 0x1d BPF_JMP | BPF_X | BPF_JEQ bpf_jeq dst, src, +off PC += off if dst == src 0x25 BPF_JMP | BPF_K | BPF_JGT bpf_jgt dst, imm, +off PC += off if dst > imm 0x2d BPF_JMP | BPF_X | BPF_JGT bpf_jgt dst, src, +off PC += off if dst > src 0x35 BPF_JMP | BPF_K | BPF_JGE bpf_jge dst, imm, +off PC += off if dst >= imm 0x3d BPF_JMP | BPF_X | BPF_JGE bpf_jge dst, src, +off PC += off if dst >= src 0x45 BPF_JMP | BPF_K | BPF_JSET bpf_jset dst, imm, +off PC += off if dst & imm 0x4d BPF_JMP | BPF_X | BPF_JSET bpf_jset dst, src, +off PC += off if dst & src 0x55 BPF_JMP | BPF_K | BPF_JNE bpf_jne dst, imm, +off PC += off if dst != imm 0x5d BPF_JMP | BPF_X | BPF_JNE bpf_jne dst, src, +off PC += off if dst != src 0x65 BPF_JMP | BPF_K | BPF_JSGT bpf_jsgt dst, imm, +off PC += off if dst > imm (signed) 0x6d BPF_JMP | BPF_X | BPF_JSGT bpf_jsgt dst, src, +off PC += off if dst > src (signed) 0x75 BPF_JMP | BPF_K | BPF_JSGE bpf_jsge dst, imm, +off PC += off if dst >= imm (signed) 0x7d BPF_JMP | BPF_X | BPF_JSGE bpf_jsge dst, src, +off PC += off if dst >= src (signed) 0x85 BPF_JMP | BPF_K | BPF_CALL bpf_call imm Function call 0x95 BPF_JMP | BPF_K | BPF_EXIT bpf_exit return r0 0xa5 BPF_JMP | BPF_K | BPF_JLT bpf_jlt dst, imm, +off PC += off if dst < imm 0xad BPF_JMP | BPF_X | BPF_JLT bpf_jlt dst, src, +off PC += off if dst < src 0xb5 BPF_JMP | BPF_K | BPF_JLE bpf_jle dst, imm, +off PC += off if dst <= imm 0xbd BPF_JMP | BPF_X | BPF_JLE bpf_jle dst, src, +off PC += off if dst <= src 0xc5 BPF_JMP | BPF_K | BPF_JSLT bpf_jslt dst, imm, +off PC += off if dst < imm (signed) 0xcd BPF_JMP | BPF_X | BPF_JSLT bpf_jslt dst, src, +off PC += off if dst < src (signed) 0xd5 BPF_JMP | BPF_K | BPF_JSLE bpf_jsle dst, imm, +off PC += off if dst <= imm (signed) 0xdd BPF_JMP | BPF_X | BPF_JSLE bpf_jsle dst, src, +off PC += off if dst <= src (signed) BPF_JMP32 (opc & 0x07 == 0x06) ============================== 0x16 BPF_JMP32 | BPF_K | BPF_JEQ bpf_jeq32 dst, imm, +off PC += off if dst32 == imm 0x1e BPF_JMP32 | BPF_X | BPF_JEQ bpf_jeq32 dst, src, +off PC += off if dst32 == src32 0x26 BPF_JMP32 | BPF_K | BPF_JGT bpf_jgt32 dst, imm, +off PC += off if dst32 > imm 0x2e BPF_JMP32 | BPF_X | BPF_JGT bpf_jgt32 dst, src, +off PC += off if dst32 > src32 0x36 BPF_JMP32 | BPF_K | BPF_JGE bpf_jge32 dst, imm, +off PC += off if dst32 >= imm 0x3e BPF_JMP32 | BPF_X | BPF_JGE bpf_jge32 dst, src, +off PC += off if dst32 >= src32 0x46 BPF_JMP32 | BPF_K | BPF_JSET bpf_jset32 dst, imm, +off PC += off if dst32 & imm 0x4e BPF_JMP32 | BPF_X | BPF_JSET bpf_jset32 dst, src, +off PC += off if dst32 & src32 0x56 BPF_JMP32 | BPF_K | BPF_JNE bpf_jne32 dst, imm, +off PC += off if dst32 != imm 0x5e BPF_JMP32 | BPF_X | BPF_JNE bpf_jne32 dst, src, +off PC += off if dst32 != src32 0x66 BPF_JMP32 | BPF_K | BPF_JSGT bpf_jsgt32 dst, imm, +off PC += off if dst32 > imm (signed) 0x6e BPF_JMP32 | BPF_X | BPF_JSGT bpf_jsgt32 dst, src, +off PC += off if dst32 > src32 (signed) 0x76 BPF_JMP32 | BPF_K | BPF_JSGE bpf_jsge32 dst, imm, +off PC += off if dst32 >= imm (signed) 0x7e BPF_JMP32 | BPF_X | BPF_JSGE bpf_jsge32 dst, src, +off PC += off if dst32 >= src32 (signed) 0xa6 BPF_JMP32 | BPF_K | BPF_JLT bpf_jlt32 dst, imm, +off PC += off if dst32 < imm 0xae BPF_JMP32 | BPF_X | BPF_JLT bpf_jlt32 dst, src, +off PC += off if dst32 < src32 0xb6 BPF_JMP32 | BPF_K | BPF_JLE bpf_jle32 dst, imm, +off PC += off if dst32 <= imm 0xbe BPF_JMP32 | BPF_X | BPF_JLE bpf_jle32 dst, src, +off PC += off if dst32 <= src32 0xc6 BPF_JMP32 | BPF_K | BPF_JSLT bpf_jslt32 dst, imm, +off PC += off if dst32 < imm (signed) 0xce BPF_JMP32 | BPF_X | BPF_JSLT bpf_jslt32 dst, src, +off PC += off if dst32 < src32 (signed) 0xd6 BPF_JMP32 | BPF_K | BPF_JSLE bpf_jsle32 dst, imm, +off PC += off if dst32 <= imm (signed) 0xde BPF_JMP32 | BPF_X | BPF_JSLE bpf_jsle32 dst, src, +off PC += off if dst32 <= src32 (signed) BPF_LD (opc & 0x07 == 0x00) =========================== 0x18 BPF_LD | BPF_DW | BPF_IMM bpf_lddw dst, imm64 dst = imm64 Note: imm64 expressed in 8 bytes following instruction BPF_LDX (opc & 0x07 == 0x01) ============================ 0x61 BPF_LDX | BPF_W | BPF_MEM bpf_ldxw dst, [src + off] dst32 = *(u32 *)(src + off) 0x69 BPF_LDX | BPF_H | BPF_MEM bpf_ldxh dst, [src + off] dst16 = *(u16 *)(src + off) 0x71 BPF_LDX | BPF_B | BPF_MEM bpf_ldxb dst, [src + off] dst8 = *(u8 *)(src + off) 0x79 BPF_LDX | BPF_DW | BPF_MEM bpf_ldxdw dst, [src + off] dst = *(u64 *)(src + off) BPF_ST (opc & 0x07 == 0x02) =========================== 0x62 BPF_ST | BPF_W | BPF_IMM bpf_stw [dst + off], imm *(u32 *)(dst + off) = imm 0x6a BPF_ST | BPF_H | BPF_IMM bpf_sth [dst + off], imm *(u16 *)(dst + off) = imm 0x72 BPF_ST | BPF_B | BPF_IMM bpf_stb [dst + off], imm *(u8 *)(dst + off) = imm 0x7a BPF_ST | BPF_DW | BPF_IMM bpf_stdw [dst + off], imm *(u64 *)(dst + off) = imm BPF_STX (opc & 0x07 == 0x03) ============================ 0x63 BPF_STX | BPF_W | BPF_MEM bpf_stxw [dst + off], src *(u32 *)(dst + off) = src32 0x6b BPF_STX | BPF_H | BPF_MEM bpf_stxh [dst + off], src *(u16 *)(dst + off) = src16 0x73 BPF_STX | BPF_B | BPF_MEM bpf_stxb [dst + off], src *(u8 *)(dst + off) = src8 0x7b BPF_STX | BPF_DW | BPF_MEM bpf_stxdw [dst + off], src *(u64 *)(dst + off) = src 0xdb BPF_STX | BPF_DW | BPF_ATOMIC 64-bit atomic instructions (see below) 0xc3 BPF_STX | BPF_W | BPF_ATOMIC 32-bit atomic instructions (see below) Note: mnemonic for atomic instructions? for example, eBPF originally had only XADD atomic instruction which could be represented as bpf_xadd. But with addition of atomic operations for AND, OR, XOR - that same pattern no longer works. So for now, just show pseudocode for each atomic operation. 64-bit atomic instructions (opc == 0xdb) ======================================== The following table applies to opc 0xdb (BPF_STX | BPF_DW | BPF_ATOMIC) which are 64-bit atomic operations. The imm (immediate) value in column 1 specifies the type of atomic operation. 0x00 BPF_ADD lock *(u64 *)(dst + off) += src 0x01 BPF_ADD | BPF_FETCH src = atomic_fetch_add((u64 *)(dst + off), src) 0x40 BPF_OR lock *(u64 *)(dst + off) |= src 0x41 BPF_OR | BPF_FETCH src = atomic_fetch_or((u64 *)(dst + off), src) 0x50 BPF_AND lock *(u64 *)(dst + off) &= src 0x51 BPF_AND | BPF_FETCH src = atomic_fetch_and((u64 *)(dst + off), src) 0xa0 BPF_XOR lock *(u64 *)(dst + off) ^= src 0xa1 BPF_XOR | BPF_FETCH src = atomic_fetch_xor((u64 *)(dst + off), src) 0xe1 BPF_XCHG | BPF_FETCH src = xchg((u64 *)(dst + off), src) 0xf1 BPF_CMPXCHG | BPF_FETCH r0 = cmpxchg((u64 *)(dst + off), r0, src) 32-bit atomic instructions (opc == 0xc3) ======================================== The following table applies to opc 0xc3 (BPF_STX | BPF_W | BPF_ATOMIC) which are 32-bit atomic operations. The imm (immediate) value in column 1 specifies the type of atomic operation. 0x00 BPF_ADD lock *(u32 *)(dst + off) += src32 0x01 BPF_ADD | BPF_FETCH src32 = atomic_fetch_add((u32 *)(dst + off), src32) 0x40 BPF_OR lock *(u32 *)(dst + off) |= src32 0x41 BPF_OR | BPF_FETCH src32 = atomic_fetch_or((u32 *)(dst + off), src32) 0x50 BPF_AND lock *(u32 *)(dst + off) &= src32 0x51 BPF_AND | BPF_FETCH src32 = atomic_fetch_and((u32 *)(dst + off), src32) 0xa0 BPF_XOR lock *(u32 *)(dst + off) ^= src32 0xa1 BPF_XOR | BPF_FETCH src32 = atomic_fetch_xor((u32 *)(dst + off), src32) 0xe1 BPF_XCHG | BPF_FETCH src32 = xchg((u32 *)(dst + off), src32) 0xf1 BPF_CMPXCHG | BPF_FETCH r0 = cmpxchg((u32 *)(dst + off), r0, src32)