Ilya Leoshkevich <iii@xxxxxxxxxxxxx> writes: > On Thu, 2024-03-21 at 12:08 +0000, Puranjay Mohan wrote: >> With BPF_PROBE_MEM, BPF allows de-referencing an untrusted pointer. >> To >> thwart invalid memory accesses, the JITs add an exception table entry >> for all such accesses. But in case the src_reg + offset overflows and >> turns into a userspace address, the BPF program might read that >> memory if >> the user has mapped it. >> >> There are architectural features that prevent the kernel from >> accessing >> userspace memory, like Privileged Access Never (PAN) on ARM64, >> Supervisor Mode Access Prevention (SMAP) on x86-64, Supervisor User >> Memory access (SUM) on RISC-V, etc. But BPF should not rely on the >> existence of these features. >> >> Make the verifier add guard instructions around such memory accesses >> and >> skip the load if the address falls into the userspace region. >> >> The JITs need to implement bpf_arch_uaddress_limit() to define where >> the userspace addresses end for that architecture or TASK_SIZE is >> taken >> as default. >> >> The implementation is as follows: >> >> REG_AX = SRC_REG >> if(offset) >> REG_AX += offset; >> REG_AX >>= 32; >> if (REG_AX <= (uaddress_limit >> 32)) >> DST_REG = 0; >> else >> DST_REG = *(size *)(SRC_REG + offset); >> >> Comparing just the upper 32 bits of the load address with the upper >> 32 bits of uaddress_limit implies that the values are being aligned >> down >> to a 4GB boundary before comparison. >> >> The above means that all loads with address <= uaddress_limit + 4GB >> are >> skipped. This is acceptable because there is a large hole (much >> larger >> than 4GB) between userspace and kernel space memory, therefore a >> correctly functioning BPF program should not access this 4GB memory >> above the userspace. >> >> Let's analyze what this patch does to the following fentry program >> dereferencing an untrusted pointer: >> >> SEC("fentry/tcp_v4_connect") >> int BPF_PROG(fentry_tcp_v4_connect, struct sock *sk) >> { >> *(volatile long *)sk; >> return 0; >> } >> >> BPF Program before | BPF Program after >> ------------------ | ----------------- >> >> 0: (79) r1 = *(u64 *)(r1 +0) 0: (79) r1 = *(u64 *)(r1 +0) >> ------------------------------------------------------------------- >> ---- >> 1: (79) r1 = *(u64 *)(r1 +0) --\ 1: (bf) r11 = r1 >> ----------------------------\ \ 2: (77) r11 >>= 32 >> 2: (b7) r0 = 0 \ \ 3: (b5) if r11 <= 0x8000 goto >> pc+2 >> 3: (95) exit \ \-> 4: (79) r1 = *(u64 *)(r1 +0) >> \ 5: (05) goto pc+1 >> \ 6: (b7) r1 = 0 >> \--------------------------------- >> ----- >> 7: (b7) r0 = 0 >> 8: (95) exit >> >> As you can see from above, in the best case (off=0), 5 extra >> instructions >> are emitted. >> >> Now, we analyse the same program after it has gone through the JITs >> of >> X86-64, ARM64, and RISC-V architectures. We follow the single load >> instruction that has the untrusted pointer and see what >> instrumentation >> has been added around it. >> >> x86-64 JIT >> ========== >> JIT's Instrumentation Verifier's >> Instrumentation >> (upstream) (This patch) >> --------------------- ------------------------- >> - >> >> 0: nopl 0x0(%rax,%rax,1) 0: nopl >> 0x0(%rax,%rax,1) >> 5: xchg %ax,%ax 5: xchg %ax,%ax >> 7: push %rbp 7: push %rbp >> 8: mov %rsp,%rbp 8: mov %rsp,%rbp >> b: mov 0x0(%rdi),%rdi b: mov >> 0x0(%rdi),%rdi >> ------------------------------------------------------------------- >> ----- >> f: movabs $0x800000000000,%r11 f: mov %rdi,%r10 >> 19: cmp %r11,%rdi 12: shr $0x20,%r10 >> 1c: jb 0x000000000000002a 16: cmp $0x8000,%r10 >> 1e: mov %rdi,%r11 1d: jbe >> 0x0000000000000025 >> 21: add $0x0,%r11 /--> 1f: mov >> 0x0(%rdi),%rdi >> 28: jae 0x000000000000002e / 23: jmp >> 0x0000000000000027 >> 2a: xor %edi,%edi / 25: xor %edi,%edi >> 2c: jmp 0x0000000000000032 / /------------------------------- >> ----- >> 2e: mov 0x0(%rdi),%rdi ---/ / 27: xor %eax,%eax >> ---------------------------------/ 29: leave >> 32: xor %eax,%eax 2a: ret >> 34: leave >> 35: ret >> >> The x86-64 JIT already emits some instructions to protect against >> user >> memory access. The implementation in this patch leads to a smaller >> number of instructions being emitted. In the worst case the JIT will >> emit 9 extra instructions and this patch decreases it to 7. >> >> ARM64 JIT >> ========= >> >> No Intrumentation Verifier's >> Instrumentation >> (upstream) (This patch) >> ----------------- --------------------- >> ----- >> >> 0: add x9, x30, #0x0 0: add x9, x30, >> #0x0 >> 4: nop 4: nop >> 8: paciasp 8: paciasp >> c: stp x29, x30, [sp, #-16]! c: stp x29, x30, >> [sp, #-16]! >> 10: mov x29, sp 10: mov x29, sp >> 14: stp x19, x20, [sp, #-16]! 14: stp x19, x20, >> [sp, #-16]! >> 18: stp x21, x22, [sp, #-16]! 18: stp x21, x22, >> [sp, #-16]! >> 1c: stp x25, x26, [sp, #-16]! 1c: stp x25, x26, >> [sp, #-16]! >> 20: stp x27, x28, [sp, #-16]! 20: stp x27, x28, >> [sp, #-16]! >> 24: mov x25, sp 24: mov x25, sp >> 28: mov x26, #0x0 28: mov x26, #0x0 >> 2c: sub x27, x25, #0x0 2c: sub x27, x25, >> #0x0 >> 30: sub sp, sp, #0x0 30: sub sp, sp, >> #0x0 >> 34: ldr x0, [x0] 34: ldr x0, [x0] >> --------------------------------------------------------------------- >> ----------- >> 38: ldr x0, [x0] ----------\ 38: add x9, x0, >> #0x0 >> -----------------------------------\\ 3c: lsr x9, x9, #32 >> 3c: mov x7, #0x0 \\ 40: cmp x9, #0x10, >> lsl #12 >> 40: mov sp, sp \\ 44: b.ls >> 0x0000000000000050 >> 44: ldp x27, x28, [sp], #16 \\--> 48: ldr x0, [x0] >> 48: ldp x25, x26, [sp], #16 \ 4c: b >> 0x0000000000000054 >> 4c: ldp x21, x22, [sp], #16 \ 50: mov x0, #0x0 >> 50: ldp x19, x20, [sp], #16 \--------------------------- >> ------------ >> 54: ldp x29, x30, [sp], #16 54: mov x7, #0x0 >> 58: add x0, x7, #0x0 58: mov sp, sp >> 5c: autiasp 5c: ldp x27, x28, >> [sp], #16 >> 60: ret 60: ldp x25, x26, >> [sp], #16 >> 64: nop 64: ldp x21, x22, >> [sp], #16 >> 68: ldr x10, 0x0000000000000070 68: ldp x19, x20, >> [sp], #16 >> 6c: br x10 6c: ldp x29, x30, >> [sp], #16 >> 70: add x0, x7, >> #0x0 >> 74: autiasp >> 78: ret >> 7c: nop >> 80: ldr x10, >> 0x0000000000000088 >> 84: br x10 >> >> There are 6 extra instructions added in ARM64 in the best case. This >> will >> become 7 in the worst case (off != 0). >> >> RISC-V JIT (RISCV_ISA_C Disabled) >> ========== >> >> No Intrumentation Verifier's Instrumentation >> (upstream) (This patch) >> ----------------- -------------------------- >> >> 0: nop 0: nop >> 4: nop 4: nop >> 8: li a6, 33 8: li a6, 33 >> c: addi sp, sp, -16 c: addi sp, sp, -16 >> 10: sd s0, 8(sp) 10: sd s0, 8(sp) >> 14: addi s0, sp, 16 14: addi s0, sp, 16 >> 18: ld a0, 0(a0) 18: ld a0, 0(a0) >> --------------------------------------------------------------- >> 1c: ld a0, 0(a0) --\ 1c: mv t0, a0 >> --------------------------\ \ 20: srli t0, t0, 32 >> 20: li a5, 0 \ \ 24: lui t1, 4096 >> 24: ld s0, 8(sp) \ \ 28: sext.w t1, t1 >> 28: addi sp, sp, 16 \ \ 2c: bgeu t1, t0, 12 >> 2c: sext.w a0, a5 \ \--> 30: ld a0, 0(a0) >> 30: ret \ 34: j 8 >> \ 38: li a0, 0 >> \------------------------------ >> 3c: li a5, 0 >> 40: ld s0, 8(sp) >> 44: addi sp, sp, 16 >> 48: sext.w a0, a5 >> 4c: ret >> >> There are 7 extra instructions added in RISC-V. >> >> Fixes: 800834285361 ("bpf, arm64: Add BPF exception tables") >> Reported-by: Breno Leitao <leitao@xxxxxxxxxx> >> Suggested-by: Alexei Starovoitov <ast@xxxxxxxxxx> >> Signed-off-by: Puranjay Mohan <puranjay12@xxxxxxxxx> >> --- >> V2: >> https://lore.kernel.org/bpf/20240321101058.68530-1-puranjay12@xxxxxxxxx/ >> Changes in V3: >> - Return 0 from bpf_arch_uaddress_limit() in disabled case because it >> returns u64. >> - Modify the check in verifier to no do instrumentation when >> uaddress_limit >> is 0. >> >> V1: >> https://lore.kernel.org/bpf/20240320105436.4781-1-puranjay12@xxxxxxxxx/ >> Changes in V2: >> - Disable this feature on s390x. >> --- >> arch/s390/net/bpf_jit_comp.c | 5 +++ >> arch/x86/net/bpf_jit_comp.c | 72 ++++------------------------------ >> -- >> include/linux/filter.h | 1 + >> kernel/bpf/core.c | 9 +++++ >> kernel/bpf/verifier.c | 30 +++++++++++++++ >> 5 files changed, 53 insertions(+), 64 deletions(-) > > [...] > >> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c >> index 5aacb1d3c4cc..c131bee33ac3 100644 >> --- a/kernel/bpf/core.c >> +++ b/kernel/bpf/core.c >> @@ -2958,6 +2958,15 @@ bool __weak bpf_jit_supports_arena(void) >> return false; >> } >> >> +u64 __weak bpf_arch_uaddress_limit(void) >> +{ >> +#ifdef CONFIG_64BIT >> + return TASK_SIZE; >> +#else >> + return 0; >> +#endif >> +} >> + > > How about the following? > > #if defined(CONFIG_64BIT) && \ > defined(CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE) > > Then we won't need to do anything for s390x explicitly.` Thanks for the suggestion, I will use this in v4.