[PATCH man-pages] bpf.2: new page documenting bpf(2)

Alexei Starovoitov <ast@xxxxxxxxxxxx> · Mon, 9 Mar 2015 15:10:36 -0700

Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxxxx>
---
 man2/bpf.2 |  593 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 593 insertions(+)
 create mode 100644 man2/bpf.2

diff --git a/man2/bpf.2 b/man2/bpf.2
new file mode 100644
index 0000000..21b42b4
--- /dev/null
+++ b/man2/bpf.2
@@ -0,0 +1,593 @@
+.TH BPF 2 2015-03-09 "Linux" "Linux Programmer's Manual"
+.SH NAME
+bpf - perform a command on extended BPF map or program
+.SH SYNOPSIS
+.nf
+.B #include <linux/bpf.h>
+.sp
+.BI "int bpf(int cmd, union bpf_attr *attr, unsigned int size);
+
+.SH DESCRIPTION
+.BR bpf()
+syscall is a multiplexor for a range of different operations on extended BPF
+which can be characterized as "universal in-kernel virtual machine".
+Extended BPF (or eBPF) is similar to original Berkeley Packet Filter
+(or "classic BPF") used to filter network packets. Both statically analyze
+the programs before loading them into the kernel to ensure that programs cannot
+harm the running system.
+.P
+eBPF extends classic BPF in multiple ways including ability to call
+in-kernel helper functions and access shared data structures like BPF maps.
+The programs can be written in a restricted C that is compiled into
+eBPF bytecode and executed on the in-kernel virtual machine or JITed into native
+instruction set.
+.SS Extended BPF Design/Architecture
+.P
+BPF maps is a generic storage of different types.
+User process can create multiple maps (with key/value being
+opaque bytes of data) and access them via file descriptor. In parallel BPF
+programs can access maps from inside the kernel.
+It's up to user process and BPF program to decide what they store inside maps.
+.P
+BPF programs are similar to kernel modules. They are loaded by the user
+process and automatically unloaded when process exits. Each BPF program is
+a safe run-to-completion set of instructions. BPF verifier statically
+determines that the program terminates and is safe to execute. During
+verification the program takes a hold of maps that it intends to use,
+so selected maps cannot be removed until the program is unloaded. The program
+can be attached to different events. These events can be packets, tracing
+events and other types in the future. A new event triggers execution of
+the program which may store information about the event in the maps.
+Beyond storing data the programs may call into in-kernel helper functions.
+The same program can be attached to multiple events. Different programs can
+access the same map:
+.nf
+  tracing     tracing     tracing     packet     packet
+  event A     event B     event C     on eth0    on eth1
+   |             |          |           |          |
+   |             |          |           |          |
+   --> tracing <--      tracing       socket     socket
+        prog_1           prog_2       prog_3     prog_4
+        |  |               |            |
+     |---  -----|  |-------|           map_3
+   map_1       map_2
+.fi
+.SS Syscall Arguments
+.B bpf()
+syscall operation is determined by
+.IR cmd
+which can be one of the following:
+.TP
+.B BPF_MAP_CREATE
+Create a map with given type and attributes and return map FD
+.TP
+.B BPF_MAP_LOOKUP_ELEM
+Lookup element by key in a given map and return its value
+.TP
+.B BPF_MAP_UPDATE_ELEM
+Create or update element (key/value pair) in a given map
+.TP
+.B BPF_MAP_DELETE_ELEM
+Lookup and delete element by key in a given map
+.TP
+.B BPF_MAP_GET_NEXT_KEY
+Lookup element by key in a given map and return key of next element
+.TP
+.B BPF_PROG_LOAD
+Verify and load BPF program
+.TP
+.B attr
+is a pointer to a union of type bpf_attr as defined below.
+.TP
+.B size
+is the size of the union.
+.P
+.nf
+union bpf_attr {
+    struct { /* anonymous struct used by BPF_MAP_CREATE command */
+        __u32             map_type;
+        __u32             key_size;    /* size of key in bytes */
+        __u32             value_size;  /* size of value in bytes */
+        __u32             max_entries; /* max number of entries in a map */
+    };
+
+    struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
+        __u32             map_fd;
+        __aligned_u64     key;
+        union {
+            __aligned_u64 value;
+            __aligned_u64 next_key;
+        };
+	__u64             flags;
+    };
+
+    struct { /* anonymous struct used by BPF_PROG_LOAD command */
+        __u32         prog_type;
+        __u32         insn_cnt;
+        __aligned_u64 insns;     /* 'const struct bpf_insn *' */
+        __aligned_u64 license;   /* 'const char *' */
+        __u32         log_level; /* verbosity level of verifier */
+        __u32         log_size;  /* size of user buffer */
+        __aligned_u64 log_buf;   /* user supplied 'char *' buffer */
+    };
+} __attribute__((aligned(8)));
+.fi
+.SS BPF maps
+maps is a generic storage of different types for sharing data between kernel
+and userspace.
+
+Any map type has the following attributes:
+  . type
+  . max number of elements
+  . key size in bytes
+  . value size in bytes
+
+The following wrapper functions demonstrate how this syscall can be used to
+access the maps. The functions use the
+.IR cmd
+argument to invoke different operations.
+.TP
+.B BPF_MAP_CREATE
+.nf
+int bpf_create_map(enum bpf_map_type map_type, int key_size,
+                   int value_size, int max_entries)
+{
+    union bpf_attr attr = {
+        .map_type = map_type,
+        .key_size = key_size,
+        .value_size = value_size,
+        .max_entries = max_entries
+    };
+
+    return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
+}
+.fi
+bpf() syscall creates a map of
+.I map_type
+type and given attributes
+.I key_size, value_size, max_entries.
+On success it returns process-local file descriptor. On error, \-1 is returned and
+.I errno
+is set to EINVAL or EPERM or ENOMEM.
+
+The attributes
+.I key_size
+and
+.I value_size
+will be used by verifier during program loading to check that program is calling
+bpf_map_*_elem() helper functions with correctly initialized
+.I key
+and that program doesn't access map element
+.I value
+beyond specified
+.I value_size.
+For example, when map is created with key_size = 8 and program does:
+.nf
+bpf_map_lookup_elem(map_fd, fp - 4)
+.fi
+such program will be rejected,
+since in-kernel helper function bpf_map_lookup_elem(map_fd, void *key) expects
+to read 8 bytes from 'key' pointer, but 'fp - 4' starting address will cause
+out of bounds stack access.
+
+Similarly, when map is created with value_size = 1 and program does:
+.nf
+value = bpf_map_lookup_elem(...);
+*(u32 *)value = 1;
+.fi
+such program will be rejected, since it accesses
+.I value
+pointer beyond specified 1 byte value_size limit.
+
+Currently two
+.I map_type
+are supported:
+.nf
+enum bpf_map_type {
+   BPF_MAP_TYPE_UNSPEC,
+   BPF_MAP_TYPE_HASH,
+   BPF_MAP_TYPE_ARRAY,
+};
+.fi
+.I map_type
+selects one of the available map implementations in kernel. For all map_types
+programs access maps with the same bpf_map_lookup_elem()/bpf_map_update_elem()
+helper functions.
+.TP
+.B BPF_MAP_LOOKUP_ELEM
+.nf
+int bpf_lookup_elem(int fd, void *key, void *value)
+{
+    union bpf_attr attr = {
+        .map_fd = fd,
+        .key = ptr_to_u64(key),
+        .value = ptr_to_u64(value),
+    };
+
+    return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
+}
+.fi
+bpf() syscall looks up an element with given
+.I key
+in a map
+.I fd.
+If element is found it returns zero and stores element's value into
+.I value.
+If element is not found it returns \-1 and sets
+.I errno
+to ENOENT.
+.TP
+.B BPF_MAP_UPDATE_ELEM
+.nf
+int bpf_update_elem(int fd, void *key, void *value, __u64 flags)
+{
+    union bpf_attr attr = {
+        .map_fd = fd,
+        .key = ptr_to_u64(key),
+        .value = ptr_to_u64(value),
+        .flags = flags,
+    };
+
+    return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
+}
+.fi
+The call creates or updates element with given
+.I key/value
+in a map
+.I fd
+according to
+.I flags
+which can have 3 possible values:
+.nf
+#define BPF_ANY         0 /* create new element or update existing */
+#define BPF_NOEXIST     1 /* create new element if it didn't exist */
+#define BPF_EXIST       2 /* update existing element */
+.fi
+On success it returns zero.
+On error, \-1 is returned and
+.I errno
+is set to EINVAL or EPERM or ENOMEM or E2BIG.
+.B E2BIG
+indicates that number of elements in the map reached
+.I max_entries
+limit specified at map creation time.
+.B EEXIST
+will be returned from call bpf_update_elem(fd, key, value, BPF_NOEXIST) if element
+with 'key' already exists in the map.
+.B ENOENT
+will be returned from call bpf_update_elem(fd, key, value, BPF_EXIST) if element
+with 'key' doesn't exist in the map.
+.TP
+.B BPF_MAP_DELETE_ELEM
+.nf
+int bpf_delete_elem(int fd, void *key)
+{
+    union bpf_attr attr = {
+        .map_fd = fd,
+        .key = ptr_to_u64(key),
+    };
+
+    return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
+}
+.fi
+The call deletes an element in a map
+.I fd
+with given
+.I key.
+Returns zero on success. If element is not found it returns \-1 and sets
+.I errno
+to ENOENT.
+.TP
+.B BPF_MAP_GET_NEXT_KEY
+.nf
+int bpf_get_next_key(int fd, void *key, void *next_key)
+{
+    union bpf_attr attr = {
+        .map_fd = fd,
+        .key = ptr_to_u64(key),
+        .next_key = ptr_to_u64(next_key),
+    };
+
+    return bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr));
+}
+.fi
+The call looks up an element by
+.I key
+in a given map
+.I fd
+and returns key of the next element into
+.I next_key
+pointer. If
+.I key
+is not found, it return zero and returns key of the first element into
+.I next_key. If
+.I key
+is the last element, it returns \-1 and sets
+.I errno
+to ENOENT. Other possible
+.I errno
+values are ENOMEM, EFAULT, EPERM, EINVAL.
+This method can be used to iterate over all elements of the map.
+.TP
+.B close(map_fd)
+will delete the map
+.I map_fd.
+Exiting process will delete all maps automatically.
+.P
+.SS BPF programs
+
+.TP
+.B BPF_PROG_LOAD
+This
+.IR cmd
+is used to load extended BPF program into the kernel.
+
+.nf
+char bpf_log_buf[LOG_BUF_SIZE];
+
+int bpf_prog_load(enum bpf_prog_type prog_type,
+                  const struct bpf_insn *insns, int insn_cnt,
+                  const char *license)
+{
+    union bpf_attr attr = {
+        .prog_type = prog_type,
+        .insns = ptr_to_u64(insns),
+        .insn_cnt = insn_cnt,
+        .license = ptr_to_u64(license),
+        .log_buf = ptr_to_u64(bpf_log_buf),
+        .log_size = LOG_BUF_SIZE,
+        .log_level = 1,
+    };
+
+    return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
+}
+.fi
+.B prog_type
+is one of the available program types:
+.nf
+enum bpf_prog_type {
+        BPF_PROG_TYPE_UNSPEC,
+        BPF_PROG_TYPE_SOCKET_FILTER,
+        BPF_PROG_TYPE_SCHED_CLS,
+};
+.fi
+By picking
+.I prog_type
+program author selects a set of helper functions callable from
+the program and corresponding format of
+.I struct bpf_context
+(which is the data blob passed into the program as the first argument).
+For example, the programs loaded with
+.I prog_type
+= BPF_PROG_TYPE_SOCKET_FILTER may call bpf_map_lookup_elem() helper,
+whereas some future types may not be.
+The set of functions available to the programs under given type may increase
+in the future.
+
+Currently the set of functions for
+.B BPF_PROG_TYPE_SOCKET_FILTER
+is:
+.nf
+bpf_map_lookup_elem(map_fd, void *key)              // lookup key in a map_fd
+bpf_map_update_elem(map_fd, void *key, void *value) // update key/value
+bpf_map_delete_elem(map_fd, void *key)              // delete key in a map_fd
+.fi
+
+and bpf_context is a pointer to 'struct sk_buff'. Programs cannot
+access fields of 'sk_buff' directly.
+
+More program types may be added in the future. Like
+.B BPF_PROG_TYPE_KPROBE
+and bpf_context for it may be defined as a pointer to 'struct pt_regs'.
+
+.B insns
+array of "struct bpf_insn" instructions
+
+.B insn_cnt
+number of instructions in the program
+
+.B license
+license string, which must be GPL compatible to call helper functions
+marked gpl_only
+
+.B log_buf
+user supplied buffer that in-kernel verifier is using to store verification
+log. Log is a multi-line string that should be used by program author to
+understand how verifier came to conclusion that program is unsafe. The format
+of the output can change at any time as verifier evolves.
+
+.B log_size
+size of user buffer. If size of the buffer is not large enough to store all
+verifier messages, \-1 is returned and
+.I errno
+is set to ENOSPC.
+
+.B log_level
+verbosity level of verifier, where zero means no logs provided
+.TP
+.B close(prog_fd)
+will unload BPF program
+.P
+The maps are accesible from programs and used to exchange data between
+programs and between program and user space.
+Programs process various events (like kprobe, packets) and
+store the data into maps. User space fetches data from maps.
+Either the same or a different map may be used by user space as configuration
+space to alter program behavior on the fly.
+.SS Events
+.P
+Once the program is loaded, it can be attached to an event. Various kernel
+subsystems have different ways to do so. For example:
+
+.nf
+setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd));
+.fi
+will attach the program
+.I prog_fd
+to socket
+.I sock
+which was received by prior call to socket().
+
+In the future
+.nf
+ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
+.fi
+may attach the program
+.I prog_fd
+to perf event
+.I event_fd
+which was received by prior call to perf_event_open().
+
+.SH EXAMPLES
+.nf
+/* bpf+sockets example:
+ * 1. create array map of 256 elements
+ * 2. load program that counts number of packets received
+ *    r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)]
+ *    map[r0]++
+ * 3. attach prog_fd to raw socket via setsockopt()
+ * 4. print number of received TCP/UDP packets every second
+ */
+int main(int ac, char **av)
+{
+    int sock, map_fd, prog_fd, key;
+    long long value = 0, tcp_cnt, udp_cnt;
+
+    map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key), sizeof(value), 256);
+    if (map_fd < 0) {
+        printf("failed to create map '%s'\\n", strerror(errno));
+        /* likely not run as root */
+        return 1;
+    }
+
+    struct bpf_insn prog[] = {
+        BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),           /* r6 = r1 */
+        BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol)), /* r0 = ip->proto */
+        BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
+        BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),          /* r2 = fp */
+        BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),         /* r2 = r2 - 4 */
+        BPF_LD_MAP_FD(BPF_REG_1, map_fd),              /* r1 = map_fd */
+        BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),       /* r0 = map_lookup(r1, r2) */
+        BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),         /* if (r0 == 0) goto pc+2 */
+        BPF_MOV64_IMM(BPF_REG_1, 1),                   /* r1 = 1 */
+        BPF_XADD(BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0),  /* lock *(u64 *)r0 += r1 */
+        BPF_MOV64_IMM(BPF_REG_0, 0),                   /* r0 = 0 */
+        BPF_EXIT_INSN(),                               /* return r0 */
+    };
+
+    prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog, sizeof(prog), "GPL");
+
+    sock = open_raw_sock("lo");
+
+    assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd)) == 0);
+
+    for (;;) {
+        key = IPPROTO_TCP;
+        assert(bpf_lookup_elem(map_fd, &key, &tcp_cnt) == 0);
+        key = IPPROTO_UDP
+        assert(bpf_lookup_elem(map_fd, &key, &udp_cnt) == 0);
+        printf("TCP %lld UDP %lld packets\n", tcp_cnt, udp_cnt);
+        sleep(1);
+    }
+
+    return 0;
+}
+.fi
+.SH RETURN VALUE
+For a successful call, the return value depends on the operation:
+.TP
+.B BPF_MAP_CREATE
+The new file descriptor associated with BPF map.
+.TP
+.B BPF_PROG_LOAD
+The new file descriptor associated with BPF program.
+.TP
+All other commands
+Zero.
+.PP
+On error, \-1 is returned, and
+.I errno
+is set appropriately.
+.SH ERRORS
+.TP
+.B EPERM
+bpf() syscall was made without sufficient privilege
+(without the
+.B CAP_SYS_ADMIN
+capability).
+.TP
+.B ENOMEM
+Cannot allocate sufficient memory.
+.TP
+.B EBADF
+.I fd
+is not an open file descriptor
+.TP
+.B EFAULT
+One of the pointers (
+.I key
+or
+.I value
+or
+.I log_buf
+or
+.I insns
+) is outside accessible address space.
+.TP
+.B EINVAL
+The value specified in
+.I cmd
+is not recognized by this kernel.
+.TP
+.B EINVAL
+For
+.BR BPF_MAP_CREATE ,
+either
+.I map_type
+or attributes are invalid.
+.TP
+.B EINVAL
+For
+.BR BPF_MAP_*_ELEM
+commands,
+some of the fields of "union bpf_attr" unused by this command are not set
+to zero.
+.TP
+.B EINVAL
+For
+.BR BPF_PROG_LOAD,
+attempt to load invalid program (unrecognized instruction or uses reserved
+fields or jumps out of range or loop detected or calls unknown function).
+.TP
+.BR EACCES
+For
+.BR BPF_PROG_LOAD,
+though program has valid instructions, it was rejected, since it was deemed
+unsafe (may access disallowed memory region or uninitialized stack/register
+or function constraints don't match actual types or misaligned access). In
+such case it is recommended to call bpf() again with
+.I log_level = 1
+and examine
+.I log_buf
+for specific reason provided by verifier.
+.TP
+.BR ENOENT
+For
+.B BPF_MAP_LOOKUP_ELEM
+or
+.B BPF_MAP_DELETE_ELEM,
+indicates that element with given
+.I key
+was not found.
+.TP
+.BR E2BIG
+program is too large or
+a map reached
+.I max_entries
+limit (max number of elements).
+.SH NOTES
+These commands may be used only by a privileged process (one having the
+.B CAP_SYS_ADMIN
+capability).
+.SH SEE ALSO
+Both classic and extended BPF is explained in Documentation/networking/filter.txt
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html