Hi All, one more RFC... Major difference vs previous set is a new 'load 64-bit immediate' eBPF insn. Which is first 16-byte instruction. It shows how eBPF ISA can be extended while maintaining backward compatibility, but mainly it cleans up eBPF program access to maps and improves run-time performance. In V3 I've been using 'fixup' section in eBPF program to tell kernel which instructions are accessing maps. With new instruction 'fixup' is gone and map IDR (internal map_ids) are removed. To understand the logic behind new insn, I need to explain two main eBPF design constraints: 1. eBPF interpreter must be generic. It should know nothing about maps or any custom instructions or functions. 2. llvm compiler backend must be generic. It also should know nothing about maps, helper functions, sockets, tracing, etc. LLVM just takes normal C and compiles it for some 'fake' HW that happened to be called eBPF ISA. patch #1 implements BPF_LD_IMM64 insn. It's just a move of 64-bit immediate value into a register. Nothing fancy. The reason it improved eBPF program run-time is the following: in V3 the program used to look like: bpf_mov r1, const_internal_map_id bpf_call bpf_map_lookup so in-kernel bpf_map_lookup() helper would do map_id->map_ptr conversion via map = idr_find(&bpf_map_id_idr, map_id); For the life of the program map_id is constant and that lookup was returning the same value, but there was no easy way to store pointer inside eBPF insn. With new insn the programs look like: bpf_ld_imm64 r1, const_internal_map_ptr bpf_call bpf_map_lookup and the bpf_map_lookup() helper does: struct bpf_map *map = (struct bpf_map *) (unsigned long) r1; Though it's a small performance gain, every nsec counts. Also new insn allows further optimizations in JIT compilers. How does it help to cleanup program interface towards maps? Obviously user space doesn't know what kernel map pointer is associated with process-local map-FD. So it's using pseudo BPF_LD_IMM64 instruction. BPF_LD_IMM64 with src_reg == 0 -> generic move 64-bit immediate into dst_reg BPF_LD_IMM64 with src_reg == BPF_PSEUDO_MAP_FD -> mov map_fd into dst_reg Other values are reserved for now. (They will be used to implement global variables, strings and other constants and per-cpu areas in the future) So the programs look like: BPF_LD_MAP_FD(BPF_REG_1, process_local_map_fd), BPF_CALL(BPF_FUNC_map_lookup_elem), eBPF verifier scans the program for such pseudo instructions, converts process_local_map_fd -> in-kernel map pointer and drops 'pseudo' flag of BPF_LD_IMM64 instruction. eBPF interpreter stays generic and LLVM stays generic, since they know nothing about pseudo instructions. Another pseudo instruction is BPF_CALL. User space encodes one of BPF_FUNC_xxx function ids as part of 'imm' field of the instruction and eBPF program loader converts it to in-kernel helper function pointer. The idea to use special instructions to access maps was suggested by Jonathan ;) It took awhile to figure out how to do it within above two design constraints, but the end result I think is much cleaner than what I had in V2/V3. Another difference vs previous set is verifier split into 6 patches and verifier testsuite is added. Beyond old checks verifier got 'tidiness' checks to make sure all unused fields of instructions are zero. Unfortunately classic BPF doesn't check for this. Lesson learned. Tracing use case got some improvements as well. Now eBPF programs can be attached to tracepoint, syscall, kprobe and C examples are more usable: ex1_kern.c - demonstrate how programs can walk in-kernel data structures ex2_kern.c - in-kernel event accounting and user space histograms See patch #25 TODO: - verifier is safe, but not secure, since it allows kernel address leaking. fix that before lifting root-only restriction - allow seecomp to use eBPF - write manpage for eBPF syscall As always all patches are available at: git://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf master V3->V4: - introduced 'load 64-bit immediate' eBPF instruction - use BPF_LD_IMM64 in LLVM, verifier, programs - got rid of 'fixup' section in eBPF programs - got rid of map IDR and internal map_id - split verifier into 6 patches and added verifier testsuite - add verifier check for reserved instruction fields - fixed bug in LLVM eBPF backend (it was miscompiling __builtin_expect) - fixed race condition in htab_map_update_elem() - tracing filters can now attach to tracepoint, syscall, kprobe events - improved C examples V2->V3: - fixed verifier register range bug and addressed other comments (Thanks Kees!) - re-added LLVM eBPF backend - added two examples in C - user space ELF parser and loader example V1->V2: - got rid of global id, everything now FD based (Thanks Andy!) - split type enum in verifier (as suggested by Andy and Namhyung) - switched gpl enforcement to be kmod like (as suggested by Andy and David) - addressed feedback from Namhyung, Chema, Joe - added more comments to verifier - renamed sock_filter_int -> bpf_insn - rebased on net-next FD approach made eBPF user interface much cleaner for sockets/seccomp/tracing use cases. Now socket and tracing examples (patch 15 and 16) can be Ctrl-C in the middle and kernel will auto cleanup everything including tracing filters. ---- Old V1 cover letter: 'maps' is a generic storage of different types for sharing data between kernel and userspace. Maps are referrenced by file descriptor. Root process can create multiple maps of different types where key/value are opaque bytes of data. It's up to user space and eBPF program to decide what they store in the maps. eBPF programs are similar to kernel modules. They are loaded by the user space program and unload on closing of fd. Each program is a safe run-to-completion set of instructions. eBPF verifier statically determines that the program terminates and safe to execute. During verification the program takes a hold of maps that it intends to use, so selected maps cannot be removed until program is unloaded. The program can be attached to different events. These events can be packets, tracepoint events and other types in the future. New event triggers execution of the program which may store information about the event in the maps. Beyond storing data the programs may call into in-kernel helper functions which may, for example, dump stack, do trace_printk or other forms of live kernel debugging. Same program can be attached to multiple events. Different programs can access the same map: tracepoint tracepoint tracepoint sk_buff sk_buff event A event B event C on eth0 on eth1 | | | | | | | | | | --> tracing <-- tracing socket socket prog_1 prog_2 prog_3 prog_4 | | | | |--- -----| |-------| map_3 map_1 map_2 User space (via syscall) and eBPF programs access maps concurrently. ------ Alexei Starovoitov (26): net: filter: add "load 64-bit immediate" eBPF instruction net: filter: split filter.h and expose eBPF to user space bpf: introduce syscall(BPF, ...) and BPF maps bpf: enable bpf syscall on x64 bpf: add lookup/update/delete/iterate methods to BPF maps bpf: add hashtable type of BPF maps bpf: expand BPF syscall with program load/unload bpf: handle pseudo BPF_CALL insn bpf: verifier (add docs) bpf: verifier (add ability to receive verification log) bpf: handle pseudo BPF_LD_IMM64 insn bpf: verifier (add branch/goto checks) bpf: verifier (add verifier core) bpf: verifier (add state prunning optimization) bpf: allow eBPF programs to use maps net: sock: allow eBPF programs to be attached to sockets tracing: allow eBPF programs to be attached to events tracing: allow eBPF programs to be attached to kprobe/kretprobe samples: bpf: add mini eBPF library to manipulate maps and programs samples: bpf: example of stateful socket filtering samples: bpf: example of tracing filters with eBPF bpf: llvm backend samples: bpf: elf file loader samples: bpf: eBPF example in C samples: bpf: counting eBPF example in C bpf: verifier test -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html