Re: [PATCH man-pages] bpf.2: new page documenting bpf(2)

"Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> · Tue, 10 Mar 2015 06:50:16 +0100

Hi Alexei,

The page needs a license. See 
https://www.kernel.org/doc/man-pages/licenses.html
for some possible choices.

Thanks,

Michael

On 03/09/2015 11:10 PM, Alexei Starovoitov wrote:
> Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxxxx>
> ---
>  man2/bpf.2 |  593 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 593 insertions(+)
>  create mode 100644 man2/bpf.2
> 
> diff --git a/man2/bpf.2 b/man2/bpf.2
> new file mode 100644
> index 0000000..21b42b4
> --- /dev/null
> +++ b/man2/bpf.2
> @@ -0,0 +1,593 @@
> +.TH BPF 2 2015-03-09 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +bpf - perform a command on extended BPF map or program
> +.SH SYNOPSIS
> +.nf
> +.B #include <linux/bpf.h>
> +.sp
> +.BI "int bpf(int cmd, union bpf_attr *attr, unsigned int size);
> +
> +.SH DESCRIPTION
> +.BR bpf()
> +syscall is a multiplexor for a range of different operations on extended BPF
> +which can be characterized as "universal in-kernel virtual machine".
> +Extended BPF (or eBPF) is similar to original Berkeley Packet Filter
> +(or "classic BPF") used to filter network packets. Both statically analyze
> +the programs before loading them into the kernel to ensure that programs cannot
> +harm the running system.
> +.P
> +eBPF extends classic BPF in multiple ways including ability to call
> +in-kernel helper functions and access shared data structures like BPF maps.
> +The programs can be written in a restricted C that is compiled into
> +eBPF bytecode and executed on the in-kernel virtual machine or JITed into native
> +instruction set.
> +.SS Extended BPF Design/Architecture
> +.P
> +BPF maps is a generic storage of different types.
> +User process can create multiple maps (with key/value being
> +opaque bytes of data) and access them via file descriptor. In parallel BPF
> +programs can access maps from inside the kernel.
> +It's up to user process and BPF program to decide what they store inside maps.
> +.P
> +BPF programs are similar to kernel modules. They are loaded by the user
> +process and automatically unloaded when process exits. Each BPF program is
> +a safe run-to-completion set of instructions. BPF verifier statically
> +determines that the program terminates and is safe to execute. During
> +verification the program takes a hold of maps that it intends to use,
> +so selected maps cannot be removed until the program is unloaded. The program
> +can be attached to different events. These events can be packets, tracing
> +events and other types in the future. A new event triggers execution of
> +the program which may store information about the event in the maps.
> +Beyond storing data the programs may call into in-kernel helper functions.
> +The same program can be attached to multiple events. Different programs can
> +access the same map:
> +.nf
> +  tracing     tracing     tracing     packet     packet
> +  event A     event B     event C     on eth0    on eth1
> +   |             |          |           |          |
> +   |             |          |           |          |
> +   --> tracing <--      tracing       socket     socket
> +        prog_1           prog_2       prog_3     prog_4
> +        |  |               |            |
> +     |---  -----|  |-------|           map_3
> +   map_1       map_2
> +.fi
> +.SS Syscall Arguments
> +.B bpf()
> +syscall operation is determined by
> +.IR cmd
> +which can be one of the following:
> +.TP
> +.B BPF_MAP_CREATE
> +Create a map with given type and attributes and return map FD
> +.TP
> +.B BPF_MAP_LOOKUP_ELEM
> +Lookup element by key in a given map and return its value
> +.TP
> +.B BPF_MAP_UPDATE_ELEM
> +Create or update element (key/value pair) in a given map
> +.TP
> +.B BPF_MAP_DELETE_ELEM
> +Lookup and delete element by key in a given map
> +.TP
> +.B BPF_MAP_GET_NEXT_KEY
> +Lookup element by key in a given map and return key of next element
> +.TP
> +.B BPF_PROG_LOAD
> +Verify and load BPF program
> +.TP
> +.B attr
> +is a pointer to a union of type bpf_attr as defined below.
> +.TP
> +.B size
> +is the size of the union.
> +.P
> +.nf
> +union bpf_attr {
> +    struct { /* anonymous struct used by BPF_MAP_CREATE command */
> +        __u32             map_type;
> +        __u32             key_size;    /* size of key in bytes */
> +        __u32             value_size;  /* size of value in bytes */
> +        __u32             max_entries; /* max number of entries in a map */
> +    };
> +
> +    struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
> +        __u32             map_fd;
> +        __aligned_u64     key;
> +        union {
> +            __aligned_u64 value;
> +            __aligned_u64 next_key;
> +        };
> +	__u64             flags;
> +    };
> +
> +    struct { /* anonymous struct used by BPF_PROG_LOAD command */
> +        __u32         prog_type;
> +        __u32         insn_cnt;
> +        __aligned_u64 insns;     /* 'const struct bpf_insn *' */
> +        __aligned_u64 license;   /* 'const char *' */
> +        __u32         log_level; /* verbosity level of verifier */
> +        __u32         log_size;  /* size of user buffer */
> +        __aligned_u64 log_buf;   /* user supplied 'char *' buffer */
> +    };
> +} __attribute__((aligned(8)));
> +.fi
> +.SS BPF maps
> +maps is a generic storage of different types for sharing data between kernel
> +and userspace.
> +
> +Any map type has the following attributes:
> +  . type
> +  . max number of elements
> +  . key size in bytes
> +  . value size in bytes
> +
> +The following wrapper functions demonstrate how this syscall can be used to
> +access the maps. The functions use the
> +.IR cmd
> +argument to invoke different operations.
> +.TP
> +.B BPF_MAP_CREATE
> +.nf
> +int bpf_create_map(enum bpf_map_type map_type, int key_size,
> +                   int value_size, int max_entries)
> +{
> +    union bpf_attr attr = {
> +        .map_type = map_type,
> +        .key_size = key_size,
> +        .value_size = value_size,
> +        .max_entries = max_entries
> +    };
> +
> +    return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
> +}
> +.fi
> +bpf() syscall creates a map of
> +.I map_type
> +type and given attributes
> +.I key_size, value_size, max_entries.
> +On success it returns process-local file descriptor. On error, \-1 is returned and
> +.I errno
> +is set to EINVAL or EPERM or ENOMEM.
> +
> +The attributes
> +.I key_size
> +and
> +.I value_size
> +will be used by verifier during program loading to check that program is calling
> +bpf_map_*_elem() helper functions with correctly initialized
> +.I key
> +and that program doesn't access map element
> +.I value
> +beyond specified
> +.I value_size.
> +For example, when map is created with key_size = 8 and program does:
> +.nf
> +bpf_map_lookup_elem(map_fd, fp - 4)
> +.fi
> +such program will be rejected,
> +since in-kernel helper function bpf_map_lookup_elem(map_fd, void *key) expects
> +to read 8 bytes from 'key' pointer, but 'fp - 4' starting address will cause
> +out of bounds stack access.
> +
> +Similarly, when map is created with value_size = 1 and program does:
> +.nf
> +value = bpf_map_lookup_elem(...);
> +*(u32 *)value = 1;
> +.fi
> +such program will be rejected, since it accesses
> +.I value
> +pointer beyond specified 1 byte value_size limit.
> +
> +Currently two
> +.I map_type
> +are supported:
> +.nf
> +enum bpf_map_type {
> +   BPF_MAP_TYPE_UNSPEC,
> +   BPF_MAP_TYPE_HASH,
> +   BPF_MAP_TYPE_ARRAY,
> +};
> +.fi
> +.I map_type
> +selects one of the available map implementations in kernel. For all map_types
> +programs access maps with the same bpf_map_lookup_elem()/bpf_map_update_elem()
> +helper functions.
> +.TP
> +.B BPF_MAP_LOOKUP_ELEM
> +.nf
> +int bpf_lookup_elem(int fd, void *key, void *value)
> +{
> +    union bpf_attr attr = {
> +        .map_fd = fd,
> +        .key = ptr_to_u64(key),
> +        .value = ptr_to_u64(value),
> +    };
> +
> +    return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
> +}
> +.fi
> +bpf() syscall looks up an element with given
> +.I key
> +in a map
> +.I fd.
> +If element is found it returns zero and stores element's value into
> +.I value.
> +If element is not found it returns \-1 and sets
> +.I errno
> +to ENOENT.
> +.TP
> +.B BPF_MAP_UPDATE_ELEM
> +.nf
> +int bpf_update_elem(int fd, void *key, void *value, __u64 flags)
> +{
> +    union bpf_attr attr = {
> +        .map_fd = fd,
> +        .key = ptr_to_u64(key),
> +        .value = ptr_to_u64(value),
> +        .flags = flags,
> +    };
> +
> +    return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
> +}
> +.fi
> +The call creates or updates element with given
> +.I key/value
> +in a map
> +.I fd
> +according to
> +.I flags
> +which can have 3 possible values:
> +.nf
> +#define BPF_ANY         0 /* create new element or update existing */
> +#define BPF_NOEXIST     1 /* create new element if it didn't exist */
> +#define BPF_EXIST       2 /* update existing element */
> +.fi
> +On success it returns zero.
> +On error, \-1 is returned and
> +.I errno
> +is set to EINVAL or EPERM or ENOMEM or E2BIG.
> +.B E2BIG
> +indicates that number of elements in the map reached
> +.I max_entries
> +limit specified at map creation time.
> +.B EEXIST
> +will be returned from call bpf_update_elem(fd, key, value, BPF_NOEXIST) if element
> +with 'key' already exists in the map.
> +.B ENOENT
> +will be returned from call bpf_update_elem(fd, key, value, BPF_EXIST) if element
> +with 'key' doesn't exist in the map.
> +.TP
> +.B BPF_MAP_DELETE_ELEM
> +.nf
> +int bpf_delete_elem(int fd, void *key)
> +{
> +    union bpf_attr attr = {
> +        .map_fd = fd,
> +        .key = ptr_to_u64(key),
> +    };
> +
> +    return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
> +}
> +.fi
> +The call deletes an element in a map
> +.I fd
> +with given
> +.I key.
> +Returns zero on success. If element is not found it returns \-1 and sets
> +.I errno
> +to ENOENT.
> +.TP
> +.B BPF_MAP_GET_NEXT_KEY
> +.nf
> +int bpf_get_next_key(int fd, void *key, void *next_key)
> +{
> +    union bpf_attr attr = {
> +        .map_fd = fd,
> +        .key = ptr_to_u64(key),
> +        .next_key = ptr_to_u64(next_key),
> +    };
> +
> +    return bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr));
> +}
> +.fi
> +The call looks up an element by
> +.I key
> +in a given map
> +.I fd
> +and returns key of the next element into
> +.I next_key
> +pointer. If
> +.I key
> +is not found, it return zero and returns key of the first element into
> +.I next_key. If
> +.I key
> +is the last element, it returns \-1 and sets
> +.I errno
> +to ENOENT. Other possible
> +.I errno
> +values are ENOMEM, EFAULT, EPERM, EINVAL.
> +This method can be used to iterate over all elements of the map.
> +.TP
> +.B close(map_fd)
> +will delete the map
> +.I map_fd.
> +Exiting process will delete all maps automatically.
> +.P
> +.SS BPF programs
> +
> +.TP
> +.B BPF_PROG_LOAD
> +This
> +.IR cmd
> +is used to load extended BPF program into the kernel.
> +
> +.nf
> +char bpf_log_buf[LOG_BUF_SIZE];
> +
> +int bpf_prog_load(enum bpf_prog_type prog_type,
> +                  const struct bpf_insn *insns, int insn_cnt,
> +                  const char *license)
> +{
> +    union bpf_attr attr = {
> +        .prog_type = prog_type,
> +        .insns = ptr_to_u64(insns),
> +        .insn_cnt = insn_cnt,
> +        .license = ptr_to_u64(license),
> +        .log_buf = ptr_to_u64(bpf_log_buf),
> +        .log_size = LOG_BUF_SIZE,
> +        .log_level = 1,
> +    };
> +
> +    return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
> +}
> +.fi
> +.B prog_type
> +is one of the available program types:
> +.nf
> +enum bpf_prog_type {
> +        BPF_PROG_TYPE_UNSPEC,
> +        BPF_PROG_TYPE_SOCKET_FILTER,
> +        BPF_PROG_TYPE_SCHED_CLS,
> +};
> +.fi
> +By picking
> +.I prog_type
> +program author selects a set of helper functions callable from
> +the program and corresponding format of
> +.I struct bpf_context
> +(which is the data blob passed into the program as the first argument).
> +For example, the programs loaded with
> +.I prog_type
> += BPF_PROG_TYPE_SOCKET_FILTER may call bpf_map_lookup_elem() helper,
> +whereas some future types may not be.
> +The set of functions available to the programs under given type may increase
> +in the future.
> +
> +Currently the set of functions for
> +.B BPF_PROG_TYPE_SOCKET_FILTER
> +is:
> +.nf
> +bpf_map_lookup_elem(map_fd, void *key)              // lookup key in a map_fd
> +bpf_map_update_elem(map_fd, void *key, void *value) // update key/value
> +bpf_map_delete_elem(map_fd, void *key)              // delete key in a map_fd
> +.fi
> +
> +and bpf_context is a pointer to 'struct sk_buff'. Programs cannot
> +access fields of 'sk_buff' directly.
> +
> +More program types may be added in the future. Like
> +.B BPF_PROG_TYPE_KPROBE
> +and bpf_context for it may be defined as a pointer to 'struct pt_regs'.
> +
> +.B insns
> +array of "struct bpf_insn" instructions
> +
> +.B insn_cnt
> +number of instructions in the program
> +
> +.B license
> +license string, which must be GPL compatible to call helper functions
> +marked gpl_only
> +
> +.B log_buf
> +user supplied buffer that in-kernel verifier is using to store verification
> +log. Log is a multi-line string that should be used by program author to
> +understand how verifier came to conclusion that program is unsafe. The format
> +of the output can change at any time as verifier evolves.
> +
> +.B log_size
> +size of user buffer. If size of the buffer is not large enough to store all
> +verifier messages, \-1 is returned and
> +.I errno
> +is set to ENOSPC.
> +
> +.B log_level
> +verbosity level of verifier, where zero means no logs provided
> +.TP
> +.B close(prog_fd)
> +will unload BPF program
> +.P
> +The maps are accesible from programs and used to exchange data between
> +programs and between program and user space.
> +Programs process various events (like kprobe, packets) and
> +store the data into maps. User space fetches data from maps.
> +Either the same or a different map may be used by user space as configuration
> +space to alter program behavior on the fly.
> +.SS Events
> +.P
> +Once the program is loaded, it can be attached to an event. Various kernel
> +subsystems have different ways to do so. For example:
> +
> +.nf
> +setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd));
> +.fi
> +will attach the program
> +.I prog_fd
> +to socket
> +.I sock
> +which was received by prior call to socket().
> +
> +In the future
> +.nf
> +ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
> +.fi
> +may attach the program
> +.I prog_fd
> +to perf event
> +.I event_fd
> +which was received by prior call to perf_event_open().
> +
> +.SH EXAMPLES
> +.nf
> +/* bpf+sockets example:
> + * 1. create array map of 256 elements
> + * 2. load program that counts number of packets received
> + *    r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)]
> + *    map[r0]++
> + * 3. attach prog_fd to raw socket via setsockopt()
> + * 4. print number of received TCP/UDP packets every second
> + */
> +int main(int ac, char **av)
> +{
> +    int sock, map_fd, prog_fd, key;
> +    long long value = 0, tcp_cnt, udp_cnt;
> +
> +    map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key), sizeof(value), 256);
> +    if (map_fd < 0) {
> +        printf("failed to create map '%s'\\n", strerror(errno));
> +        /* likely not run as root */
> +        return 1;
> +    }
> +
> +    struct bpf_insn prog[] = {
> +        BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),           /* r6 = r1 */
> +        BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol)), /* r0 = ip->proto */
> +        BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
> +        BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),          /* r2 = fp */
> +        BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),         /* r2 = r2 - 4 */
> +        BPF_LD_MAP_FD(BPF_REG_1, map_fd),              /* r1 = map_fd */
> +        BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),       /* r0 = map_lookup(r1, r2) */
> +        BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),         /* if (r0 == 0) goto pc+2 */
> +        BPF_MOV64_IMM(BPF_REG_1, 1),                   /* r1 = 1 */
> +        BPF_XADD(BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0),  /* lock *(u64 *)r0 += r1 */
> +        BPF_MOV64_IMM(BPF_REG_0, 0),                   /* r0 = 0 */
> +        BPF_EXIT_INSN(),                               /* return r0 */
> +    };
> +
> +    prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog, sizeof(prog), "GPL");
> +
> +    sock = open_raw_sock("lo");
> +
> +    assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd)) == 0);
> +
> +    for (;;) {
> +        key = IPPROTO_TCP;
> +        assert(bpf_lookup_elem(map_fd, &key, &tcp_cnt) == 0);
> +        key = IPPROTO_UDP
> +        assert(bpf_lookup_elem(map_fd, &key, &udp_cnt) == 0);
> +        printf("TCP %lld UDP %lld packets\n", tcp_cnt, udp_cnt);
> +        sleep(1);
> +    }
> +
> +    return 0;
> +}
> +.fi
> +.SH RETURN VALUE
> +For a successful call, the return value depends on the operation:
> +.TP
> +.B BPF_MAP_CREATE
> +The new file descriptor associated with BPF map.
> +.TP
> +.B BPF_PROG_LOAD
> +The new file descriptor associated with BPF program.
> +.TP
> +All other commands
> +Zero.
> +.PP
> +On error, \-1 is returned, and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +.TP
> +.B EPERM
> +bpf() syscall was made without sufficient privilege
> +(without the
> +.B CAP_SYS_ADMIN
> +capability).
> +.TP
> +.B ENOMEM
> +Cannot allocate sufficient memory.
> +.TP
> +.B EBADF
> +.I fd
> +is not an open file descriptor
> +.TP
> +.B EFAULT
> +One of the pointers (
> +.I key
> +or
> +.I value
> +or
> +.I log_buf
> +or
> +.I insns
> +) is outside accessible address space.
> +.TP
> +.B EINVAL
> +The value specified in
> +.I cmd
> +is not recognized by this kernel.
> +.TP
> +.B EINVAL
> +For
> +.BR BPF_MAP_CREATE ,
> +either
> +.I map_type
> +or attributes are invalid.
> +.TP
> +.B EINVAL
> +For
> +.BR BPF_MAP_*_ELEM
> +commands,
> +some of the fields of "union bpf_attr" unused by this command are not set
> +to zero.
> +.TP
> +.B EINVAL
> +For
> +.BR BPF_PROG_LOAD,
> +attempt to load invalid program (unrecognized instruction or uses reserved
> +fields or jumps out of range or loop detected or calls unknown function).
> +.TP
> +.BR EACCES
> +For
> +.BR BPF_PROG_LOAD,
> +though program has valid instructions, it was rejected, since it was deemed
> +unsafe (may access disallowed memory region or uninitialized stack/register
> +or function constraints don't match actual types or misaligned access). In
> +such case it is recommended to call bpf() again with
> +.I log_level = 1
> +and examine
> +.I log_buf
> +for specific reason provided by verifier.
> +.TP
> +.BR ENOENT
> +For
> +.B BPF_MAP_LOOKUP_ELEM
> +or
> +.B BPF_MAP_DELETE_ELEM,
> +indicates that element with given
> +.I key
> +was not found.
> +.TP
> +.BR E2BIG
> +program is too large or
> +a map reached
> +.I max_entries
> +limit (max number of elements).
> +.SH NOTES
> +These commands may be used only by a privileged process (one having the
> +.B CAP_SYS_ADMIN
> +capability).
> +.SH SEE ALSO
> +Both classic and extended BPF is explained in Documentation/networking/filter.txt
> 

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html