Hi Alexei, The page needs a license. See https://www.kernel.org/doc/man-pages/licenses.html for some possible choices. Thanks, Michael On 03/09/2015 11:10 PM, Alexei Starovoitov wrote: > Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxxxx> > --- > man2/bpf.2 | 593 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 593 insertions(+) > create mode 100644 man2/bpf.2 > > diff --git a/man2/bpf.2 b/man2/bpf.2 > new file mode 100644 > index 0000000..21b42b4 > --- /dev/null > +++ b/man2/bpf.2 > @@ -0,0 +1,593 @@ > +.TH BPF 2 2015-03-09 "Linux" "Linux Programmer's Manual" > +.SH NAME > +bpf - perform a command on extended BPF map or program > +.SH SYNOPSIS > +.nf > +.B #include <linux/bpf.h> > +.sp > +.BI "int bpf(int cmd, union bpf_attr *attr, unsigned int size); > + > +.SH DESCRIPTION > +.BR bpf() > +syscall is a multiplexor for a range of different operations on extended BPF > +which can be characterized as "universal in-kernel virtual machine". > +Extended BPF (or eBPF) is similar to original Berkeley Packet Filter > +(or "classic BPF") used to filter network packets. Both statically analyze > +the programs before loading them into the kernel to ensure that programs cannot > +harm the running system. > +.P > +eBPF extends classic BPF in multiple ways including ability to call > +in-kernel helper functions and access shared data structures like BPF maps. > +The programs can be written in a restricted C that is compiled into > +eBPF bytecode and executed on the in-kernel virtual machine or JITed into native > +instruction set. > +.SS Extended BPF Design/Architecture > +.P > +BPF maps is a generic storage of different types. > +User process can create multiple maps (with key/value being > +opaque bytes of data) and access them via file descriptor. In parallel BPF > +programs can access maps from inside the kernel. > +It's up to user process and BPF program to decide what they store inside maps. > +.P > +BPF programs are similar to kernel modules. They are loaded by the user > +process and automatically unloaded when process exits. Each BPF program is > +a safe run-to-completion set of instructions. BPF verifier statically > +determines that the program terminates and is safe to execute. During > +verification the program takes a hold of maps that it intends to use, > +so selected maps cannot be removed until the program is unloaded. The program > +can be attached to different events. These events can be packets, tracing > +events and other types in the future. A new event triggers execution of > +the program which may store information about the event in the maps. > +Beyond storing data the programs may call into in-kernel helper functions. > +The same program can be attached to multiple events. Different programs can > +access the same map: > +.nf > + tracing tracing tracing packet packet > + event A event B event C on eth0 on eth1 > + | | | | | > + | | | | | > + --> tracing <-- tracing socket socket > + prog_1 prog_2 prog_3 prog_4 > + | | | | > + |--- -----| |-------| map_3 > + map_1 map_2 > +.fi > +.SS Syscall Arguments > +.B bpf() > +syscall operation is determined by > +.IR cmd > +which can be one of the following: > +.TP > +.B BPF_MAP_CREATE > +Create a map with given type and attributes and return map FD > +.TP > +.B BPF_MAP_LOOKUP_ELEM > +Lookup element by key in a given map and return its value > +.TP > +.B BPF_MAP_UPDATE_ELEM > +Create or update element (key/value pair) in a given map > +.TP > +.B BPF_MAP_DELETE_ELEM > +Lookup and delete element by key in a given map > +.TP > +.B BPF_MAP_GET_NEXT_KEY > +Lookup element by key in a given map and return key of next element > +.TP > +.B BPF_PROG_LOAD > +Verify and load BPF program > +.TP > +.B attr > +is a pointer to a union of type bpf_attr as defined below. > +.TP > +.B size > +is the size of the union. > +.P > +.nf > +union bpf_attr { > + struct { /* anonymous struct used by BPF_MAP_CREATE command */ > + __u32 map_type; > + __u32 key_size; /* size of key in bytes */ > + __u32 value_size; /* size of value in bytes */ > + __u32 max_entries; /* max number of entries in a map */ > + }; > + > + struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */ > + __u32 map_fd; > + __aligned_u64 key; > + union { > + __aligned_u64 value; > + __aligned_u64 next_key; > + }; > + __u64 flags; > + }; > + > + struct { /* anonymous struct used by BPF_PROG_LOAD command */ > + __u32 prog_type; > + __u32 insn_cnt; > + __aligned_u64 insns; /* 'const struct bpf_insn *' */ > + __aligned_u64 license; /* 'const char *' */ > + __u32 log_level; /* verbosity level of verifier */ > + __u32 log_size; /* size of user buffer */ > + __aligned_u64 log_buf; /* user supplied 'char *' buffer */ > + }; > +} __attribute__((aligned(8))); > +.fi > +.SS BPF maps > +maps is a generic storage of different types for sharing data between kernel > +and userspace. > + > +Any map type has the following attributes: > + . type > + . max number of elements > + . key size in bytes > + . value size in bytes > + > +The following wrapper functions demonstrate how this syscall can be used to > +access the maps. The functions use the > +.IR cmd > +argument to invoke different operations. > +.TP > +.B BPF_MAP_CREATE > +.nf > +int bpf_create_map(enum bpf_map_type map_type, int key_size, > + int value_size, int max_entries) > +{ > + union bpf_attr attr = { > + .map_type = map_type, > + .key_size = key_size, > + .value_size = value_size, > + .max_entries = max_entries > + }; > + > + return bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); > +} > +.fi > +bpf() syscall creates a map of > +.I map_type > +type and given attributes > +.I key_size, value_size, max_entries. > +On success it returns process-local file descriptor. On error, \-1 is returned and > +.I errno > +is set to EINVAL or EPERM or ENOMEM. > + > +The attributes > +.I key_size > +and > +.I value_size > +will be used by verifier during program loading to check that program is calling > +bpf_map_*_elem() helper functions with correctly initialized > +.I key > +and that program doesn't access map element > +.I value > +beyond specified > +.I value_size. > +For example, when map is created with key_size = 8 and program does: > +.nf > +bpf_map_lookup_elem(map_fd, fp - 4) > +.fi > +such program will be rejected, > +since in-kernel helper function bpf_map_lookup_elem(map_fd, void *key) expects > +to read 8 bytes from 'key' pointer, but 'fp - 4' starting address will cause > +out of bounds stack access. > + > +Similarly, when map is created with value_size = 1 and program does: > +.nf > +value = bpf_map_lookup_elem(...); > +*(u32 *)value = 1; > +.fi > +such program will be rejected, since it accesses > +.I value > +pointer beyond specified 1 byte value_size limit. > + > +Currently two > +.I map_type > +are supported: > +.nf > +enum bpf_map_type { > + BPF_MAP_TYPE_UNSPEC, > + BPF_MAP_TYPE_HASH, > + BPF_MAP_TYPE_ARRAY, > +}; > +.fi > +.I map_type > +selects one of the available map implementations in kernel. For all map_types > +programs access maps with the same bpf_map_lookup_elem()/bpf_map_update_elem() > +helper functions. > +.TP > +.B BPF_MAP_LOOKUP_ELEM > +.nf > +int bpf_lookup_elem(int fd, void *key, void *value) > +{ > + union bpf_attr attr = { > + .map_fd = fd, > + .key = ptr_to_u64(key), > + .value = ptr_to_u64(value), > + }; > + > + return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr)); > +} > +.fi > +bpf() syscall looks up an element with given > +.I key > +in a map > +.I fd. > +If element is found it returns zero and stores element's value into > +.I value. > +If element is not found it returns \-1 and sets > +.I errno > +to ENOENT. > +.TP > +.B BPF_MAP_UPDATE_ELEM > +.nf > +int bpf_update_elem(int fd, void *key, void *value, __u64 flags) > +{ > + union bpf_attr attr = { > + .map_fd = fd, > + .key = ptr_to_u64(key), > + .value = ptr_to_u64(value), > + .flags = flags, > + }; > + > + return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr)); > +} > +.fi > +The call creates or updates element with given > +.I key/value > +in a map > +.I fd > +according to > +.I flags > +which can have 3 possible values: > +.nf > +#define BPF_ANY 0 /* create new element or update existing */ > +#define BPF_NOEXIST 1 /* create new element if it didn't exist */ > +#define BPF_EXIST 2 /* update existing element */ > +.fi > +On success it returns zero. > +On error, \-1 is returned and > +.I errno > +is set to EINVAL or EPERM or ENOMEM or E2BIG. > +.B E2BIG > +indicates that number of elements in the map reached > +.I max_entries > +limit specified at map creation time. > +.B EEXIST > +will be returned from call bpf_update_elem(fd, key, value, BPF_NOEXIST) if element > +with 'key' already exists in the map. > +.B ENOENT > +will be returned from call bpf_update_elem(fd, key, value, BPF_EXIST) if element > +with 'key' doesn't exist in the map. > +.TP > +.B BPF_MAP_DELETE_ELEM > +.nf > +int bpf_delete_elem(int fd, void *key) > +{ > + union bpf_attr attr = { > + .map_fd = fd, > + .key = ptr_to_u64(key), > + }; > + > + return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr)); > +} > +.fi > +The call deletes an element in a map > +.I fd > +with given > +.I key. > +Returns zero on success. If element is not found it returns \-1 and sets > +.I errno > +to ENOENT. > +.TP > +.B BPF_MAP_GET_NEXT_KEY > +.nf > +int bpf_get_next_key(int fd, void *key, void *next_key) > +{ > + union bpf_attr attr = { > + .map_fd = fd, > + .key = ptr_to_u64(key), > + .next_key = ptr_to_u64(next_key), > + }; > + > + return bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr)); > +} > +.fi > +The call looks up an element by > +.I key > +in a given map > +.I fd > +and returns key of the next element into > +.I next_key > +pointer. If > +.I key > +is not found, it return zero and returns key of the first element into > +.I next_key. If > +.I key > +is the last element, it returns \-1 and sets > +.I errno > +to ENOENT. Other possible > +.I errno > +values are ENOMEM, EFAULT, EPERM, EINVAL. > +This method can be used to iterate over all elements of the map. > +.TP > +.B close(map_fd) > +will delete the map > +.I map_fd. > +Exiting process will delete all maps automatically. > +.P > +.SS BPF programs > + > +.TP > +.B BPF_PROG_LOAD > +This > +.IR cmd > +is used to load extended BPF program into the kernel. > + > +.nf > +char bpf_log_buf[LOG_BUF_SIZE]; > + > +int bpf_prog_load(enum bpf_prog_type prog_type, > + const struct bpf_insn *insns, int insn_cnt, > + const char *license) > +{ > + union bpf_attr attr = { > + .prog_type = prog_type, > + .insns = ptr_to_u64(insns), > + .insn_cnt = insn_cnt, > + .license = ptr_to_u64(license), > + .log_buf = ptr_to_u64(bpf_log_buf), > + .log_size = LOG_BUF_SIZE, > + .log_level = 1, > + }; > + > + return bpf(BPF_PROG_LOAD, &attr, sizeof(attr)); > +} > +.fi > +.B prog_type > +is one of the available program types: > +.nf > +enum bpf_prog_type { > + BPF_PROG_TYPE_UNSPEC, > + BPF_PROG_TYPE_SOCKET_FILTER, > + BPF_PROG_TYPE_SCHED_CLS, > +}; > +.fi > +By picking > +.I prog_type > +program author selects a set of helper functions callable from > +the program and corresponding format of > +.I struct bpf_context > +(which is the data blob passed into the program as the first argument). > +For example, the programs loaded with > +.I prog_type > += BPF_PROG_TYPE_SOCKET_FILTER may call bpf_map_lookup_elem() helper, > +whereas some future types may not be. > +The set of functions available to the programs under given type may increase > +in the future. > + > +Currently the set of functions for > +.B BPF_PROG_TYPE_SOCKET_FILTER > +is: > +.nf > +bpf_map_lookup_elem(map_fd, void *key) // lookup key in a map_fd > +bpf_map_update_elem(map_fd, void *key, void *value) // update key/value > +bpf_map_delete_elem(map_fd, void *key) // delete key in a map_fd > +.fi > + > +and bpf_context is a pointer to 'struct sk_buff'. Programs cannot > +access fields of 'sk_buff' directly. > + > +More program types may be added in the future. Like > +.B BPF_PROG_TYPE_KPROBE > +and bpf_context for it may be defined as a pointer to 'struct pt_regs'. > + > +.B insns > +array of "struct bpf_insn" instructions > + > +.B insn_cnt > +number of instructions in the program > + > +.B license > +license string, which must be GPL compatible to call helper functions > +marked gpl_only > + > +.B log_buf > +user supplied buffer that in-kernel verifier is using to store verification > +log. Log is a multi-line string that should be used by program author to > +understand how verifier came to conclusion that program is unsafe. The format > +of the output can change at any time as verifier evolves. > + > +.B log_size > +size of user buffer. If size of the buffer is not large enough to store all > +verifier messages, \-1 is returned and > +.I errno > +is set to ENOSPC. > + > +.B log_level > +verbosity level of verifier, where zero means no logs provided > +.TP > +.B close(prog_fd) > +will unload BPF program > +.P > +The maps are accesible from programs and used to exchange data between > +programs and between program and user space. > +Programs process various events (like kprobe, packets) and > +store the data into maps. User space fetches data from maps. > +Either the same or a different map may be used by user space as configuration > +space to alter program behavior on the fly. > +.SS Events > +.P > +Once the program is loaded, it can be attached to an event. Various kernel > +subsystems have different ways to do so. For example: > + > +.nf > +setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd)); > +.fi > +will attach the program > +.I prog_fd > +to socket > +.I sock > +which was received by prior call to socket(). > + > +In the future > +.nf > +ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd); > +.fi > +may attach the program > +.I prog_fd > +to perf event > +.I event_fd > +which was received by prior call to perf_event_open(). > + > +.SH EXAMPLES > +.nf > +/* bpf+sockets example: > + * 1. create array map of 256 elements > + * 2. load program that counts number of packets received > + * r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)] > + * map[r0]++ > + * 3. attach prog_fd to raw socket via setsockopt() > + * 4. print number of received TCP/UDP packets every second > + */ > +int main(int ac, char **av) > +{ > + int sock, map_fd, prog_fd, key; > + long long value = 0, tcp_cnt, udp_cnt; > + > + map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key), sizeof(value), 256); > + if (map_fd < 0) { > + printf("failed to create map '%s'\\n", strerror(errno)); > + /* likely not run as root */ > + return 1; > + } > + > + struct bpf_insn prog[] = { > + BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), /* r6 = r1 */ > + BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol)), /* r0 = ip->proto */ > + BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */ > + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), /* r2 = fp */ > + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = r2 - 4 */ > + BPF_LD_MAP_FD(BPF_REG_1, map_fd), /* r1 = map_fd */ > + BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem), /* r0 = map_lookup(r1, r2) */ > + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), /* if (r0 == 0) goto pc+2 */ > + BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */ > + BPF_XADD(BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* lock *(u64 *)r0 += r1 */ > + BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */ > + BPF_EXIT_INSN(), /* return r0 */ > + }; > + > + prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog, sizeof(prog), "GPL"); > + > + sock = open_raw_sock("lo"); > + > + assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd)) == 0); > + > + for (;;) { > + key = IPPROTO_TCP; > + assert(bpf_lookup_elem(map_fd, &key, &tcp_cnt) == 0); > + key = IPPROTO_UDP > + assert(bpf_lookup_elem(map_fd, &key, &udp_cnt) == 0); > + printf("TCP %lld UDP %lld packets\n", tcp_cnt, udp_cnt); > + sleep(1); > + } > + > + return 0; > +} > +.fi > +.SH RETURN VALUE > +For a successful call, the return value depends on the operation: > +.TP > +.B BPF_MAP_CREATE > +The new file descriptor associated with BPF map. > +.TP > +.B BPF_PROG_LOAD > +The new file descriptor associated with BPF program. > +.TP > +All other commands > +Zero. > +.PP > +On error, \-1 is returned, and > +.I errno > +is set appropriately. > +.SH ERRORS > +.TP > +.B EPERM > +bpf() syscall was made without sufficient privilege > +(without the > +.B CAP_SYS_ADMIN > +capability). > +.TP > +.B ENOMEM > +Cannot allocate sufficient memory. > +.TP > +.B EBADF > +.I fd > +is not an open file descriptor > +.TP > +.B EFAULT > +One of the pointers ( > +.I key > +or > +.I value > +or > +.I log_buf > +or > +.I insns > +) is outside accessible address space. > +.TP > +.B EINVAL > +The value specified in > +.I cmd > +is not recognized by this kernel. > +.TP > +.B EINVAL > +For > +.BR BPF_MAP_CREATE , > +either > +.I map_type > +or attributes are invalid. > +.TP > +.B EINVAL > +For > +.BR BPF_MAP_*_ELEM > +commands, > +some of the fields of "union bpf_attr" unused by this command are not set > +to zero. > +.TP > +.B EINVAL > +For > +.BR BPF_PROG_LOAD, > +attempt to load invalid program (unrecognized instruction or uses reserved > +fields or jumps out of range or loop detected or calls unknown function). > +.TP > +.BR EACCES > +For > +.BR BPF_PROG_LOAD, > +though program has valid instructions, it was rejected, since it was deemed > +unsafe (may access disallowed memory region or uninitialized stack/register > +or function constraints don't match actual types or misaligned access). In > +such case it is recommended to call bpf() again with > +.I log_level = 1 > +and examine > +.I log_buf > +for specific reason provided by verifier. > +.TP > +.BR ENOENT > +For > +.B BPF_MAP_LOOKUP_ELEM > +or > +.B BPF_MAP_DELETE_ELEM, > +indicates that element with given > +.I key > +was not found. > +.TP > +.BR E2BIG > +program is too large or > +a map reached > +.I max_entries > +limit (max number of elements). > +.SH NOTES > +These commands may be used only by a privileged process (one having the > +.B CAP_SYS_ADMIN > +capability). > +.SH SEE ALSO > +Both classic and extended BPF is explained in Documentation/networking/filter.txt > -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html