On Thu, 2024-03-21 at 12:46 +0000, Puranjay Mohan wrote: > With BPF_PROBE_MEM, BPF allows de-referencing an untrusted pointer. > To > thwart invalid memory accesses, the JITs add an exception table entry > for all such accesses. But in case the src_reg + offset overflows and > turns into a userspace address, the BPF program might read that > memory if > the user has mapped it. > > There are architectural features that prevent the kernel from > accessing > userspace memory, like Privileged Access Never (PAN) on ARM64, > Supervisor Mode Access Prevention (SMAP) on x86-64, Supervisor User > Memory access (SUM) on RISC-V, etc. But BPF should not rely on the > existence of these features. > > Make the verifier add guard instructions around such memory accesses > and > skip the load if the address falls into the userspace region. > > The JITs need to implement bpf_arch_uaddress_limit() to define where > the userspace addresses end for that architecture or TASK_SIZE is > taken > as default. > > The implementation is as follows: > > REG_AX = SRC_REG > if(offset) > REG_AX += offset; > REG_AX >>= 32; > if (REG_AX <= (uaddress_limit >> 32)) > DST_REG = 0; > else > DST_REG = *(size *)(SRC_REG + offset); > > Comparing just the upper 32 bits of the load address with the upper > 32 bits of uaddress_limit implies that the values are being aligned > down > to a 4GB boundary before comparison. > > The above means that all loads with address <= uaddress_limit + 4GB > are > skipped. This is acceptable because there is a large hole (much > larger > than 4GB) between userspace and kernel space memory, therefore a > correctly functioning BPF program should not access this 4GB memory > above the userspace. > > Let's analyze what this patch does to the following fentry program > dereferencing an untrusted pointer: > > SEC("fentry/tcp_v4_connect") > int BPF_PROG(fentry_tcp_v4_connect, struct sock *sk) > { > *(volatile long *)sk; > return 0; > } > > BPF Program before | BPF Program after > ------------------ | ----------------- > > 0: (79) r1 = *(u64 *)(r1 +0) 0: (79) r1 = *(u64 *)(r1 +0) > ------------------------------------------------------------------- > ---- > 1: (79) r1 = *(u64 *)(r1 +0) --\ 1: (bf) r11 = r1 > ----------------------------\ \ 2: (77) r11 >>= 32 > 2: (b7) r0 = 0 \ \ 3: (b5) if r11 <= 0x8000 goto > pc+2 > 3: (95) exit \ \-> 4: (79) r1 = *(u64 *)(r1 +0) > \ 5: (05) goto pc+1 > \ 6: (b7) r1 = 0 > \--------------------------------- > ----- > 7: (b7) r0 = 0 > 8: (95) exit > > As you can see from above, in the best case (off=0), 5 extra > instructions > are emitted. > > Now, we analyse the same program after it has gone through the JITs > of > X86-64, ARM64, and RISC-V architectures. We follow the single load > instruction that has the untrusted pointer and see what > instrumentation > has been added around it. > > x86-64 JIT > ========== > JIT's Instrumentation Verifier's > Instrumentation > (upstream) (This patch) > --------------------- ------------------------- > - > > 0: nopl 0x0(%rax,%rax,1) 0: nopl > 0x0(%rax,%rax,1) > 5: xchg %ax,%ax 5: xchg %ax,%ax > 7: push %rbp 7: push %rbp > 8: mov %rsp,%rbp 8: mov %rsp,%rbp > b: mov 0x0(%rdi),%rdi b: mov > 0x0(%rdi),%rdi > ------------------------------------------------------------------- > ----- > f: movabs $0x800000000000,%r11 f: mov %rdi,%r10 > 19: cmp %r11,%rdi 12: shr $0x20,%r10 > 1c: jb 0x000000000000002a 16: cmp $0x8000,%r10 > 1e: mov %rdi,%r11 1d: jbe > 0x0000000000000025 > 21: add $0x0,%r11 /--> 1f: mov > 0x0(%rdi),%rdi > 28: jae 0x000000000000002e / 23: jmp > 0x0000000000000027 > 2a: xor %edi,%edi / 25: xor %edi,%edi > 2c: jmp 0x0000000000000032 / /------------------------------- > ----- > 2e: mov 0x0(%rdi),%rdi ---/ / 27: xor %eax,%eax > ---------------------------------/ 29: leave > 32: xor %eax,%eax 2a: ret > 34: leave > 35: ret > > The x86-64 JIT already emits some instructions to protect against > user > memory access. The implementation in this patch leads to a smaller > number of instructions being emitted. In the worst case the JIT will > emit 9 extra instructions and this patch decreases it to 7. > > ARM64 JIT > ========= > > No Intrumentation Verifier's > Instrumentation > (upstream) (This patch) > ----------------- --------------------- > ----- > > 0: add x9, x30, #0x0 0: add x9, x30, > #0x0 > 4: nop 4: nop > 8: paciasp 8: paciasp > c: stp x29, x30, [sp, #-16]! c: stp x29, x30, > [sp, #-16]! > 10: mov x29, sp 10: mov x29, sp > 14: stp x19, x20, [sp, #-16]! 14: stp x19, x20, > [sp, #-16]! > 18: stp x21, x22, [sp, #-16]! 18: stp x21, x22, > [sp, #-16]! > 1c: stp x25, x26, [sp, #-16]! 1c: stp x25, x26, > [sp, #-16]! > 20: stp x27, x28, [sp, #-16]! 20: stp x27, x28, > [sp, #-16]! > 24: mov x25, sp 24: mov x25, sp > 28: mov x26, #0x0 28: mov x26, #0x0 > 2c: sub x27, x25, #0x0 2c: sub x27, x25, > #0x0 > 30: sub sp, sp, #0x0 30: sub sp, sp, > #0x0 > 34: ldr x0, [x0] 34: ldr x0, [x0] > --------------------------------------------------------------------- > ----------- > 38: ldr x0, [x0] ----------\ 38: add x9, x0, > #0x0 > -----------------------------------\\ 3c: lsr x9, x9, #32 > 3c: mov x7, #0x0 \\ 40: cmp x9, #0x10, > lsl #12 > 40: mov sp, sp \\ 44: b.ls > 0x0000000000000050 > 44: ldp x27, x28, [sp], #16 \\--> 48: ldr x0, [x0] > 48: ldp x25, x26, [sp], #16 \ 4c: b > 0x0000000000000054 > 4c: ldp x21, x22, [sp], #16 \ 50: mov x0, #0x0 > 50: ldp x19, x20, [sp], #16 \--------------------------- > ------------ > 54: ldp x29, x30, [sp], #16 54: mov x7, #0x0 > 58: add x0, x7, #0x0 58: mov sp, sp > 5c: autiasp 5c: ldp x27, x28, > [sp], #16 > 60: ret 60: ldp x25, x26, > [sp], #16 > 64: nop 64: ldp x21, x22, > [sp], #16 > 68: ldr x10, 0x0000000000000070 68: ldp x19, x20, > [sp], #16 > 6c: br x10 6c: ldp x29, x30, > [sp], #16 > 70: add x0, x7, > #0x0 > 74: autiasp > 78: ret > 7c: nop > 80: ldr x10, > 0x0000000000000088 > 84: br x10 > > There are 6 extra instructions added in ARM64 in the best case. This > will > become 7 in the worst case (off != 0). > > RISC-V JIT (RISCV_ISA_C Disabled) > ========== > > No Intrumentation Verifier's Instrumentation > (upstream) (This patch) > ----------------- -------------------------- > > 0: nop 0: nop > 4: nop 4: nop > 8: li a6, 33 8: li a6, 33 > c: addi sp, sp, -16 c: addi sp, sp, -16 > 10: sd s0, 8(sp) 10: sd s0, 8(sp) > 14: addi s0, sp, 16 14: addi s0, sp, 16 > 18: ld a0, 0(a0) 18: ld a0, 0(a0) > --------------------------------------------------------------- > 1c: ld a0, 0(a0) --\ 1c: mv t0, a0 > --------------------------\ \ 20: srli t0, t0, 32 > 20: li a5, 0 \ \ 24: lui t1, 4096 > 24: ld s0, 8(sp) \ \ 28: sext.w t1, t1 > 28: addi sp, sp, 16 \ \ 2c: bgeu t1, t0, 12 > 2c: sext.w a0, a5 \ \--> 30: ld a0, 0(a0) > 30: ret \ 34: j 8 > \ 38: li a0, 0 > \------------------------------ > 3c: li a5, 0 > 40: ld s0, 8(sp) > 44: addi sp, sp, 16 > 48: sext.w a0, a5 > 4c: ret > > There are 7 extra instructions added in RISC-V. > > Fixes: 800834285361 ("bpf, arm64: Add BPF exception tables") > Reported-by: Breno Leitao <leitao@xxxxxxxxxx> > Suggested-by: Alexei Starovoitov <ast@xxxxxxxxxx> > Signed-off-by: Puranjay Mohan <puranjay12@xxxxxxxxx> > --- > V3: > https://lore.kernel.org/bpf/20240321120842.78983-1-puranjay12@xxxxxxxxx/ > Changes in V4: > - Disable this feature on architectures that don't define > CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE. > - By doing the above, we don't need anything explicitly for s390x. > > V2: > https://lore.kernel.org/bpf/20240321101058.68530-1-puranjay12@xxxxxxxxx/ > Changes in V3: > - Return 0 from bpf_arch_uaddress_limit() in disabled case because it > returns u64. > - Modify the check in verifier to no do instrumentation when > uaddress_limit > is 0. > > V1: > https://lore.kernel.org/bpf/20240320105436.4781-1-puranjay12@xxxxxxxxx/ > Changes in V2: > - Disable this feature on s390x. > --- > arch/x86/net/bpf_jit_comp.c | 72 +++++------------------------------ > -- > include/linux/filter.h | 1 + > kernel/bpf/core.c | 9 +++++ > kernel/bpf/verifier.c | 30 ++++++++++++++++ > 4 files changed, 48 insertions(+), 64 deletions(-) Acked-by: Ilya Leoshkevich <iii@xxxxxxxxxxxxx>