Daniel Borkmann <daniel@xxxxxxxxxxxxx> writes: > On 3/22/24 4:05 PM, Puranjay Mohan wrote: > [...] >>>> + /* Make it impossible to de-reference a userspace address */ >>>> + if (BPF_CLASS(insn->code) == BPF_LDX && >>>> + (BPF_MODE(insn->code) == BPF_PROBE_MEM || >>>> + BPF_MODE(insn->code) == BPF_PROBE_MEMSX)) { >>>> + struct bpf_insn *patch = &insn_buf[0]; >>>> + u64 uaddress_limit = bpf_arch_uaddress_limit(); >>>> + >>>> + if (!uaddress_limit) >>>> + goto next_insn; >>>> + >>>> + *patch++ = BPF_MOV64_REG(BPF_REG_AX, insn->src_reg); >>>> + if (insn->off) >>>> + *patch++ = BPF_ALU64_IMM(BPF_ADD, BPF_REG_AX, insn->off); >>>> + *patch++ = BPF_ALU64_IMM(BPF_RSH, BPF_REG_AX, 32); >>>> + *patch++ = BPF_JMP_IMM(BPF_JLE, BPF_REG_AX, uaddress_limit >> 32, 2); >>>> + *patch++ = *insn; >>>> + *patch++ = BPF_JMP_IMM(BPF_JA, 0, 0, 1); >>>> + *patch++ = BPF_MOV64_IMM(insn->dst_reg, 0); >>> >>> But how does this address other cases where we could fault e.g. non-canonical, >>> vsyscall page, etc? Technically, we would have to call to copy_from_kernel_nofault_allowed() >>> to really address all the cases aside from the overflow (good catch btw!) where kernel >>> turns into user address. >> >> So, we are trying to ~simulate a call to >> copy_from_kernel_nofault_allowed() here. If the address under >> consideration is below TASK_SIZE (TASK_SIZE + 4GB to be precise) then we >> skip that load because that address could be mapped by the user. >> >> If the address is above TASK_SIZE + 4GB, we allow the load and it could >> cause a fault if the address is invalid, non-canonical etc. Taking the >> fault is fine because JIT will add an exception table entry for >> for that load with BPF_PBOBE_MEM. > > Are you sure? I don't think the kernel handles non-canonical fixup. Atleast for ARM64 for I don't see a differentiation between the handling of canonical and non-canonical addresses. do_translation_fault() checks if addr < TASK_SIZE and calls do_page_fault() or if the address is greater than TASK_SIZE (it is a kernel address), do_bad_area() is called. Both of these call __do_kernel_fault() if fault is from kernel mode and it does fixup_exception(). > >> The vsyscall page is special, this approach skips all loads from this >> page. I am not sure if that is acceptable. > > The bpf_probe_read_kernel() does handle it fine via copy_from_kernel_nofault(). bpf_probe_read_kernel() is skipping reading from the vsyscall page, that is what this patch does as well. ARM64, RISCV, and some other archs don't implement copy_from_kernel_nofault_allowed() so I think the we should fix the common case where the BPF program should not be allowed to access memory below TASK_SIZE. This would be true for all architectures. > > So there is tail risk that BPF_PROBE_* could trigger a crash. Other archs might Can you explain this a bit more, how will BPF_PROBE_* trigger a crash? > have other quirks, e.g. in case of loongarch it says highest bit set means kernel > space. Thanks, Puranjay