Re: LSF/MM session: eBPF standardization

"Harris, James R" <james.r.harris@xxxxxxxxx> · Tue, 17 May 2022 22:29:37 +0000

> On May 17, 2022, at 2:10 AM, Christoph Hellwig <hch@xxxxxx> wrote:
> 
> On Wed, May 11, 2022 at 07:39:34PM -0700, Alexei Starovoitov wrote:
>> Turns out that clang -mcpu=v1,v2,v3 are not exactly correct.
>> We've extended ISA more than three times.
>> For example when we added more atomics insns in
>> https://lore.kernel.org/bpf/20210114181751.768687-1-jackmanb@xxxxxxxxxx/
>> 
>> The corresponding llvm diff didn't add a new -mcpu flavor.
>> There was no need to do it.
> 
> .. because all uses of these new instructions are through builtins
> that wouldn't other be available, yes.
> 
>> Also llvm flags can turn a subset of insns on and off.
>> Like llvm can turn on alu32, but without <,<= insns.
>> -mcpu=v3 includes both. -mcpu=v2 are only <,<=.
>> 
>> So we need a plan B.
> 
> Yes.
> 
>> How about using the commit sha where support was added to the verifier
>> as a 'version' of the ISA ?
>> 
>> We can try to use a kernel version, but backports
>> will ruin that picture.
>> Looks like upstream 'commit sha' is the only stable number.
> 
> Using kernel release hashes is a pretty horrible interface, especially
> for non-kernel users.  I also think the compilers and other tools
> would really like some vaguely meaninfuly identifiers.
> 
>> Another approach would be to declare the current ISA as
>> 1.0 (or bpf-isa-may-2022) and
>> in the future bump it with every new insn.
> 
> I think that is a much more reasonable starting position.  However, are
> we sure the ISA actually evolves linearly?  As far as I can tell support
> for full atomics only exists for a few JITs so far.
> 
> So maybe starting with a basedline, and then just have name for
> each meaningful extension (e.g. the full atomics as a start) might be
> even better.  For the Linux kernel case we can then also have a user
> interface where userspace programs can query which extensions are
> supported before loading eBPF programs that rely on them instead of
> doing a roundtrip through the verifier.
> 
>>> - we need to decide to do about the legacy BPF packet access
>>>   instrutions.  Alexei mentioned that the modern JIT doesn't
>>>   even use those internally any more.
>> 
>> I think we need to document them as supported in the linux kernel,
>> but deprecated in general.
>> The standard might say "implementation defined" meaning that
>> different run-times don't have to support them.
> 
> Yeah.  If we do the extensions proposal above we could make these
> a specific extension as well.
> 
>> [...]
>> I don't think it's worth documenting all that.
>> I would group all undefined/underflow/overflow as implementation
>> defined and document only things that matter.
> 
> Makese sense.
> 
>>> Discussion on where to host a definitive version of the document:
>>> 
>>> - I think the rough consensus is to just host regular (hopefully
>>>   low cadence) documents and maybe the latest gratest at a eBPF
>>>   foundation website.  Whom do we need to work with at the fundation
>>>   to make this happen?
>> 
>> foundation folks cc-ed.
> 
> I'd be rally glad if we could kick off this ASAP.  Feel free to contact
> me privately if we want to keep it off the list.
> 
>>> - as idea it was brought up to write a doument with the minimal
>>>   verification requirements required for any eBPF implementation
>>>   independent of the program type.  Again I can volunteer to
>>>   draft a documentation, but I need input on what such a consensus
>>>   would be.  In this case input from the non-Linux verifier
>>>   implementors (I only know the Microsoft research one) would
>>>   be very helpful as well.
>> 
>> The verifier is a moving target.
> 
> Absolutely.
> 
>> I'd say minimal verification is the one that checks that:
>> - instructions are formed correctly
>> - opcode is valid
>> - no reserved bits are used
>> - registers are within range (r11+ are not used)
>> - combination of opcode+regs+off+imm is valid
>> - simple things like that
> 
> Sounds good.  One useful thing for this would be an opcode table
> with all the optional field usage in machine readable format.
> 
> Jim who is on CC has already built a nice table off all opcodes based
> on existing material that might be a good starting point.

Table is inline below.  I used the tables in the iovisor project
documentation as a starting point for this table.

https://github.com/iovisor/bpf-docs/blob/master/eBPF.md.

Feedback welcome.  The atomic sections at the bottom could especially use some
careful review for correctness.

BPF_ALU64 (opc & 0x07 == 0x07)
==============================
0x07		BPF_ALU64 | BPF_K | BPF_ADD	bpf_add dst, imm	dst += imm
0x0f		BPF_ALU64 | BPF_X | BPF_ADD	bpf_add dst, src	dst += src
0x17		BPF_ALU64 | BPF_K | BPF_SUB	bpf_sub dst, imm	dst -= imm
0x1f		BPF_ALU64 | BPF_X | BPF_SUB	bpf_sub dst, src	dst -= src
0x27		BPF_ALU64 | BPF_K | BPF_MUL	bpf_mul dst, imm	dst *= imm
0x2f		BPF_ALU64 | BPF_X | BPF_MUL	bpf_mul dst, src	dst *= src
0x37		BPF_ALU64 | BPF_K | BPF_DIV	bpf_div dst, imm	dst /= imm
0x3f		BPF_ALU64 | BPF_X | BPF_DIV	bpf_div dst, src	dst /= src
0x47		BPF_ALU64 | BPF_K | BPF_OR	bpf_or dst, imm		dst |= imm
0x4f		BPF_ALU64 | BPF_X | BPF_OR	bpf_or dst, src		dst |= src
0x57		BPF_ALU64 | BPF_K | BPF_AND	bpf_and dst, imm	dst &= imm
0x5f		BPF_ALU64 | BPF_X | BPF_AND	bpf_and dst, src	dst &= src
0x67		BPF_ALU64 | BPF_K | BPF_LSH	bpf_lsh dst, imm	dst <<= imm
0x6f		BPF_ALU64 | BPF_X | BPF_LSH	bpf_lsh dst, src	dst <<= src
0x77		BPF_ALU64 | BPF_K | BPF_RSH	bpf_lsh dst, imm	dst >>= imm (logical)
0x7f		BPF_ALU64 | BPF_X | BPF_RSH	bpf_lsh dst, src	dst >>= src (logical)
0x87		BPF_ALU64 | BPF_K | BPF_NEG	bpf_neg dst		dst = ~dst
0x97		BPF_ALU64 | BPF_K | BPF_MOD	bpf_mod dst, imm	dst %= imm
0x9f		BPF_ALU64 | BPF_X | BPF_MOD	bpf_mod dst, src	dst %= src
0xa7		BPF_ALU64 | BPF_K | BPF_XOR	bpf_xor dst, imm	dst ^= imm
0xaf		BPF_ALU64 | BPF_X | BPF_XOR	bpf_xor dst, src	dst ^= src
0xb7		BPF_ALU64 | BPF_K | BPF_MOV	bpf_mov dst, imm	dst = imm
0xbf		BPF_ALU64 | BPF_X | BPF_MOV	bpf_mov dst, src	dst = src
0xc7		BPF_ALU64 | BPF_K | BPF_ARSH	bpf_arsh dst, imm	dst >>= imm (arithmetic)
0xcf		BPF_ALU64 | BPF_X | BPF_ARSH	bpf_arsh dst, src	dst >>= src (arithmetic)

BPF_ALU (opc & 0x07 == 0x04)
============================
0x04		BPF_ALU | BPF_K | BPF_ADD	bpf_add32 dst, imm	dst32 += imm
0x0c		BPF_ALU | BPF_X | BPF_ADD	bpf_add32 dst, src	dst32 += src32
0x14		BPF_ALU | BPF_K | BPF_SUB	bpf_sub32 dst, imm	dst32 -= imm
0x1c		BPF_ALU | BPF_X | BPF_SUB	bpf_sub32 dst, src	dst32 -= src32
0x24		BPF_ALU | BPF_K | BPF_MUL	bpf_mul32 dst, imm	dst32 *= imm
0x2c		BPF_ALU | BPF_X | BPF_MUL	bpf_mul32 dst, src	dst32 *= src32
0x34		BPF_ALU | BPF_K | BPF_DIV	bpf_div32 dst, imm	dst32 /= imm
0x3c		BPF_ALU | BPF_X | BPF_DIV	bpf_div32 dst, src	dst32 /= src32
0x44		BPF_ALU | BPF_K | BPF_OR	bpf_or32 dst, imm	dst32 |= imm
0x4c		BPF_ALU | BPF_X | BPF_OR	bpf_or32 dst, src	dst32 |= src32
0x54		BPF_ALU | BPF_K | BPF_AND	bpf_and32 dst, imm	dst32 &= imm
0x5c		BPF_ALU | BPF_X | BPF_AND	bpf_and32 dst, src	dst32 &= src32
0x64		BPF_ALU | BPF_K | BPF_LSH	bpf_lsh32 dst, imm	dst32 <<= imm
0x6c		BPF_ALU | BPF_X | BPF_LSH	bpf_lsh32 dst, src	dst32 <<= src32
0x74		BPF_ALU | BPF_K | BPF_RSH	bpf_lsh32 dst, imm	dst32 >>= imm (logical)
0x7c		BPF_ALU | BPF_X | BPF_RSH	bpf_lsh32 dst, src	dst32 >>= src32 (logical)
0x84		BPF_ALU | BPF_K | BPF_NEG	bpf_neg32 dst		dst32 = ~dst32
0x94		BPF_ALU | BPF_K | BPF_MOD	bpf_mod32 dst, imm	dst32 %= imm
0x9c		BPF_ALU | BPF_X | BPF_MOD	bpf_mod32 dst, src	dst32 %= src32
0xa4		BPF_ALU | BPF_K | BPF_XOR	bpf_xor32 dst, imm	dst32 ^= imm
0xac		BPF_ALU | BPF_X | BPF_XOR	bpf_xor32 dst, src	dst32 ^= src32
0xb4		BPF_ALU | BPF_K | BPF_MOV	bpf_mov32 dst, imm	dst32 = imm
0xbc		BPF_ALU | BPF_X | BPF_MOV	bpf_mov32 dst, src	dst32 = src32
0xc4		BPF_ALU | BPF_K | BPF_ARSH	bpf_arsh32 dst, imm	dst32 >>= imm (arithmetic)
0xcc		BPF_ALU | BPF_X | BPF_ARSH	bpf_arsh32 dst, src	dst32 >>= src32 (arithmetic)

For BPF_END instructions, BPF_K == htole conversion, BPF_X == htobe conversion
Operation size (16, 32, 64 bit) determined by 'imm' value of instruction (upper 32 bits)

0xd4 (.imm = 16)	BPF_ALU | BPF_K | BPF_END	bpf_le16 dst	dst16 = htole16(dst16)
0xd4 (.imm = 32)	BPF_ALU | BPF_K | BPF_END	bpf_le32 dst	dst32 = htole32(dst32)
0xd4 (.imm = 64)	BPF_ALU | BPF_K | BPF_END	bpf_le64 dst	dst = htole64(dst)
0xdc (.imm = 16)	BPF_ALU | BPF_X | BPF_END	bpf_be16 dst	dst16 = htobe16(dst16)
0xdc (.imm = 32)	BPF_ALU | BPF_X | BPF_END	bpf_be32 dst	dst32 = htobe32(dst32)
0xdc (.imm = 64)	BPF_ALU | BPF_X | BPF_END	bpf_be64 dst	dst = htobe64(dst)

BPF_JMP (opc & 0x07 == 0x05)
============================
0x05		BPF_JMP | BPF_K | BPF_JA	bpf_ja +off			PC += off
0x15		BPF_JMP | BPF_K | BPF_JEQ	bpf_jeq dst, imm, +off		PC += off if dst == imm
0x1d		BPF_JMP | BPF_X | BPF_JEQ	bpf_jeq dst, src, +off		PC += off if dst == src
0x25		BPF_JMP | BPF_K | BPF_JGT	bpf_jgt dst, imm, +off		PC += off if dst > imm
0x2d		BPF_JMP | BPF_X | BPF_JGT	bpf_jgt dst, src, +off		PC += off if dst > src
0x35		BPF_JMP | BPF_K | BPF_JGE	bpf_jge dst, imm, +off		PC += off if dst >= imm
0x3d		BPF_JMP | BPF_X | BPF_JGE	bpf_jge dst, src, +off		PC += off if dst >= src
0x45		BPF_JMP | BPF_K | BPF_JSET	bpf_jset dst, imm, +off		PC += off if dst & imm
0x4d		BPF_JMP | BPF_X | BPF_JSET	bpf_jset dst, src, +off		PC += off if dst & src
0x55		BPF_JMP | BPF_K | BPF_JNE	bpf_jne dst, imm, +off		PC += off if dst != imm
0x5d		BPF_JMP | BPF_X | BPF_JNE	bpf_jne dst, src, +off		PC += off if dst != src
0x65		BPF_JMP | BPF_K | BPF_JSGT	bpf_jsgt dst, imm, +off		PC += off if dst > imm (signed)
0x6d		BPF_JMP | BPF_X | BPF_JSGT	bpf_jsgt dst, src, +off		PC += off if dst > src (signed)
0x75		BPF_JMP | BPF_K | BPF_JSGE	bpf_jsge dst, imm, +off		PC += off if dst >= imm (signed)
0x7d		BPF_JMP | BPF_X | BPF_JSGE	bpf_jsge dst, src, +off		PC += off if dst >= src (signed)
0x85		BPF_JMP | BPF_K | BPF_CALL	bpf_call imm			Function call
0x95		BPF_JMP | BPF_K | BPF_EXIT	bpf_exit			return r0
0xa5		BPF_JMP | BPF_K | BPF_JLT	bpf_jlt dst, imm, +off		PC += off if dst < imm
0xad		BPF_JMP | BPF_X | BPF_JLT	bpf_jlt dst, src, +off		PC += off if dst < src
0xb5		BPF_JMP | BPF_K | BPF_JLE	bpf_jle dst, imm, +off		PC += off if dst <= imm
0xbd		BPF_JMP | BPF_X | BPF_JLE	bpf_jle dst, src, +off		PC += off if dst <= src
0xc5		BPF_JMP | BPF_K | BPF_JSLT	bpf_jslt dst, imm, +off		PC += off if dst < imm (signed)
0xcd		BPF_JMP | BPF_X | BPF_JSLT	bpf_jslt dst, src, +off		PC += off if dst < src (signed)
0xd5		BPF_JMP | BPF_K | BPF_JSLE	bpf_jsle dst, imm, +off		PC += off if dst <= imm (signed)
0xdd		BPF_JMP | BPF_X | BPF_JSLE	bpf_jsle dst, src, +off		PC += off if dst <= src (signed)

BPF_JMP32 (opc & 0x07 == 0x06)
==============================
0x16		BPF_JMP32 | BPF_K | BPF_JEQ	bpf_jeq32 dst, imm, +off	PC += off if dst32 == imm
0x1e		BPF_JMP32 | BPF_X | BPF_JEQ	bpf_jeq32 dst, src, +off	PC += off if dst32 == src32
0x26		BPF_JMP32 | BPF_K | BPF_JGT	bpf_jgt32 dst, imm, +off	PC += off if dst32 > imm
0x2e		BPF_JMP32 | BPF_X | BPF_JGT	bpf_jgt32 dst, src, +off	PC += off if dst32 > src32
0x36		BPF_JMP32 | BPF_K | BPF_JGE	bpf_jge32 dst, imm, +off	PC += off if dst32 >= imm
0x3e		BPF_JMP32 | BPF_X | BPF_JGE	bpf_jge32 dst, src, +off	PC += off if dst32 >= src32
0x46		BPF_JMP32 | BPF_K | BPF_JSET	bpf_jset32 dst, imm, +off	PC += off if dst32 & imm
0x4e		BPF_JMP32 | BPF_X | BPF_JSET	bpf_jset32 dst, src, +off	PC += off if dst32 & src32
0x56		BPF_JMP32 | BPF_K | BPF_JNE	bpf_jne32 dst, imm, +off	PC += off if dst32 != imm
0x5e		BPF_JMP32 | BPF_X | BPF_JNE	bpf_jne32 dst, src, +off	PC += off if dst32 != src32
0x66		BPF_JMP32 | BPF_K | BPF_JSGT	bpf_jsgt32 dst, imm, +off	PC += off if dst32 > imm (signed)
0x6e		BPF_JMP32 | BPF_X | BPF_JSGT	bpf_jsgt32 dst, src, +off	PC += off if dst32 > src32 (signed)
0x76		BPF_JMP32 | BPF_K | BPF_JSGE	bpf_jsge32 dst, imm, +off	PC += off if dst32 >= imm (signed)
0x7e		BPF_JMP32 | BPF_X | BPF_JSGE	bpf_jsge32 dst, src, +off	PC += off if dst32 >= src32 (signed)
0xa6		BPF_JMP32 | BPF_K | BPF_JLT	bpf_jlt32 dst, imm, +off	PC += off if dst32 < imm
0xae		BPF_JMP32 | BPF_X | BPF_JLT	bpf_jlt32 dst, src, +off	PC += off if dst32 < src32
0xb6		BPF_JMP32 | BPF_K | BPF_JLE	bpf_jle32 dst, imm, +off	PC += off if dst32 <= imm
0xbe		BPF_JMP32 | BPF_X | BPF_JLE	bpf_jle32 dst, src, +off	PC += off if dst32 <= src32
0xc6		BPF_JMP32 | BPF_K | BPF_JSLT	bpf_jslt32 dst, imm, +off	PC += off if dst32 < imm (signed)
0xce		BPF_JMP32 | BPF_X | BPF_JSLT	bpf_jslt32 dst, src, +off	PC += off if dst32 < src32 (signed)
0xd6		BPF_JMP32 | BPF_K | BPF_JSLE	bpf_jsle32 dst, imm, +off	PC += off if dst32 <= imm (signed)
0xde		BPF_JMP32 | BPF_X | BPF_JSLE	bpf_jsle32 dst, src, +off	PC += off if dst32 <= src32 (signed)

BPF_LD (opc & 0x07 == 0x00)
===========================
0x18		BPF_LD | BPF_DW | BPF_IMM	bpf_lddw dst, imm64		dst = imm64
Note: imm64 expressed in 8 bytes following instruction

BPF_LDX (opc & 0x07 == 0x01)
============================
0x61		BPF_LDX | BPF_W | BPF_MEM	bpf_ldxw dst, [src + off]	dst32 = *(u32 *)(src + off)
0x69		BPF_LDX | BPF_H | BPF_MEM	bpf_ldxh dst, [src + off]	dst16 = *(u16 *)(src + off)
0x71		BPF_LDX | BPF_B | BPF_MEM	bpf_ldxb dst, [src + off]	dst8 = *(u8 *)(src + off)
0x79		BPF_LDX | BPF_DW | BPF_MEM	bpf_ldxdw dst, [src + off]	dst = *(u64 *)(src + off)

BPF_ST (opc & 0x07 == 0x02)
===========================
0x62		BPF_ST | BPF_W | BPF_IMM	bpf_stw [dst + off], imm	*(u32 *)(dst + off) = imm
0x6a		BPF_ST | BPF_H | BPF_IMM	bpf_sth [dst + off], imm	*(u16 *)(dst + off) = imm
0x72		BPF_ST | BPF_B | BPF_IMM	bpf_stb [dst + off], imm	*(u8 *)(dst + off) = imm
0x7a		BPF_ST | BPF_DW | BPF_IMM	bpf_stdw [dst + off], imm	*(u64 *)(dst + off) = imm

BPF_STX (opc & 0x07 == 0x03)
============================
0x63		BPF_STX | BPF_W | BPF_MEM	bpf_stxw [dst + off], src	*(u32 *)(dst + off) = src32
0x6b		BPF_STX | BPF_H | BPF_MEM	bpf_stxh [dst + off], src	*(u16 *)(dst + off) = src16
0x73		BPF_STX | BPF_B | BPF_MEM	bpf_stxb [dst + off], src	*(u8 *)(dst + off) = src8
0x7b		BPF_STX | BPF_DW | BPF_MEM	bpf_stxdw [dst + off], src	*(u64 *)(dst + off) = src
0xdb		BPF_STX | BPF_DW | BPF_ATOMIC	64-bit atomic instructions (see below)
0xc3		BPF_STX | BPF_W | BPF_ATOMIC	32-bit atomic instructions (see below)

Note: mnemonic for atomic instructions?  for example, eBPF originally had only XADD atomic instruction which could be
represented as bpf_xadd.  But with addition of atomic operations for AND, OR, XOR - that same pattern no longer works.
So for now, just show pseudocode for each atomic operation.

64-bit atomic instructions (opc == 0xdb)
========================================
The following table applies to opc 0xdb (BPF_STX | BPF_DW | BPF_ATOMIC) which are 64-bit atomic operations.
The imm (immediate) value in column 1 specifies the type of atomic operation.

0x00	BPF_ADD			lock *(u64 *)(dst + off) += src
0x01	BPF_ADD | BPF_FETCH	src = atomic_fetch_add((u64 *)(dst + off), src)
0x40	BPF_OR			lock *(u64 *)(dst + off) |= src
0x41	BPF_OR | BPF_FETCH	src = atomic_fetch_or((u64 *)(dst + off), src)
0x50	BPF_AND			lock *(u64 *)(dst + off) &= src
0x51	BPF_AND | BPF_FETCH	src = atomic_fetch_and((u64 *)(dst + off), src)
0xa0	BPF_XOR			lock *(u64 *)(dst + off) ^= src
0xa1	BPF_XOR | BPF_FETCH	src = atomic_fetch_xor((u64 *)(dst + off), src)
0xe1	BPF_XCHG | BPF_FETCH	src = xchg((u64 *)(dst + off), src)
0xf1	BPF_CMPXCHG | BPF_FETCH	r0 = cmpxchg((u64 *)(dst + off), r0, src)

32-bit atomic instructions (opc == 0xc3)
========================================
The following table applies to opc 0xc3 (BPF_STX | BPF_W | BPF_ATOMIC) which are 32-bit atomic operations.
The imm (immediate) value in column 1 specifies the type of atomic operation.

0x00	BPF_ADD			lock *(u32 *)(dst + off) += src32
0x01	BPF_ADD | BPF_FETCH	src32 = atomic_fetch_add((u32 *)(dst + off), src32)
0x40	BPF_OR			lock *(u32 *)(dst + off) |= src32
0x41	BPF_OR | BPF_FETCH	src32 = atomic_fetch_or((u32 *)(dst + off), src32)
0x50	BPF_AND			lock *(u32 *)(dst + off) &= src32
0x51	BPF_AND | BPF_FETCH	src32 = atomic_fetch_and((u32 *)(dst + off), src32)
0xa0	BPF_XOR			lock *(u32 *)(dst + off) ^= src32
0xa1	BPF_XOR | BPF_FETCH	src32 = atomic_fetch_xor((u32 *)(dst + off), src32)
0xe1	BPF_XCHG | BPF_FETCH	src32 = xchg((u32 *)(dst + off), src32)
0xf1	BPF_CMPXCHG | BPF_FETCH	r0 = cmpxchg((u32 *)(dst + off), r0, src32)