Describe what eBPF maps are, how to create them, and how to interact with them. Thanks to Quentin Monnet <quentin.monnet@xxxxxxxxx> for improving this document by fixing many typos and early review. Signed-off-by: Jesper Dangaard Brouer <brouer@xxxxxxxxxx> --- Documentation/bpf/ebpf_maps.rst | 255 +++++++++++++++++++++++++++++++++++++++ Documentation/bpf/index.rst | 2 2 files changed, 257 insertions(+) create mode 100644 Documentation/bpf/ebpf_maps.rst diff --git a/Documentation/bpf/ebpf_maps.rst b/Documentation/bpf/ebpf_maps.rst new file mode 100644 index 000000000000..b8808f3bc31c --- /dev/null +++ b/Documentation/bpf/ebpf_maps.rst @@ -0,0 +1,255 @@ +========= +eBPF maps +========= + +This document describes what eBPF maps are, how you create them +(`Creating a map`_), and how to interact with them (`Interacting with +maps`_). + +Using eBPF maps is a method to keep state between invocations of the +eBPF program, and allows sharing data between eBPF kernel programs, +and also between kernel and user-space applications. + +Basically a key/value store with arbitrary structure (from man-page +`bpf(2)`_): + + eBPF maps are a generic data structure for storage of different data + types. Data types are generally treated as binary blobs, so a user + just specifies the size of the key and the size of the value at + map-creation time. In other words, a key/value for a given map can + have an arbitrary structure. + +The map handles are file descriptors, and multiple maps can be created +and accessed by multiple programs (from man-page `bpf(2)`_): + + A user process can create multiple maps (with key/value-pairs being + opaque bytes of data) and access them via file descriptors. + Different eBPF programs can access the same maps in parallel. It's + up to the user process and eBPF program to decide what they store + inside maps. + +.. _`Creating a map`: + +Creating a map +============== + +A map is created based on a request from userspace, via the `bpf`_ +syscall (specifically `bpf_cmd`_ BPF_MAP_CREATE), which returns a new +file descriptor that refers to the map. On error, -1 is returned and +errno is set to EINVAL, EPERM, or ENOMEM. These are the struct +``bpf_attr`` setup arguments to use when creating a map via the +syscall: + +.. code-block:: c + + bpf(BPF_MAP_CREATE, &bpf_attr, sizeof(bpf_attr)); + +Notice how this kernel ABI is extensible, as more struct arguments can +easily be added later as the sizeof(bpf_attr) is passed along to the +syscall. This also implies that API users must clear/zero +sizeof(bpf_attr), as compiler can size-align the struct differently, +to avoid garbage data to be interpreted as parameters by future +kernels. + +The following configuration attributes are needed when creating the map: + +.. code-block:: c + + union bpf_attr { + struct { /* anonymous struct used by BPF_MAP_CREATE command */ + __u32 map_type; /* one of enum bpf_map_type */ + __u32 key_size; /* size of key in bytes */ + __u32 value_size; /* size of value in bytes */ + __u32 max_entries; /* max number of entries in a map */ + __u32 map_flags; /* prealloc or not */ + }; + } + +.. _bpf_cmd: http://lxr.free-electrons.com/ident?i=bpf_cmd + + +Kernel sample/bpf ELF convention +-------------------------------- + +For programs under samples/bpf/, defining a map have been integrated +with ELF binary generated by LLVM. This is purely one example of a +userspace convention and not part of the kernel ABI. It still invokes +the bpf syscall. + +Map definitions are done by defining a ``struct bpf_map_def`` with an +elf section __attribute__ ``SEC("maps")``, in the xxx_kern.c file. +The maps file descriptor is available in the userspace xxx_user.c +file, via global array variable ``map_fd[]``, and the array map index +corresponds to the order the maps sections were defined in elf file of +xxx_kern.c file. Behind the scenes it is the ``load_bpf_file()`` call +(from `samples/bpf/bpf_load`_) that takes care of parsing ELF file +compiled by LLVM, pickup 'maps' section and creates maps via the bpf +syscall. + +.. code-block:: c + + struct bpf_map_def { + unsigned int type; + unsigned int key_size; + unsigned int value_size; + unsigned int max_entries; + unsigned int map_flags; + }; + + struct bpf_map_def SEC("maps") my_map = { + .type = BPF_MAP_TYPE_XXX, + .key_size = sizeof(u32), + .value_size = sizeof(u64), + .max_entries = 42, + .map_flags = 0 + }; + +.. section links + +.. _samples/bpf/bpf_load: + https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/samples/bpf/bpf_load.c + +Qdisc Traffic Control convention +-------------------------------- + +It is worth mentioning, that qdisc TC (Traffic Control), also use ELF +files for defining the maps, but it uses another layout. See man-page +`tc-bpf(8)`_ and `tc bpf examples`_ in iproute2.git_ tree. + +.. _iproute2.git: + https://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/about/ + +.. _tc bpf examples: + https://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/tree/examples/bpf + +.. _tc-bpf(8): http://man7.org/linux/man-pages/man8/tc-bpf.8.html + + +Interacting with maps +===================== + +Interacting with eBPF maps happens through some **lookup/update/delete** +primitives. + +When writing eBFP programs using load helpers and libraries from +samples/bpf/ and tools/lib/bpf/. Common function name API have been +created that hides the details of how kernel vs. userspace access +these primitives (which is quite different). + +The common function names (parameters and return values differs): + +.. code-block:: c + + void bpf_map_lookup_elem(map, void *key. ...); + void bpf_map_update_elem(map, void *key, ..., __u64 flags); + void bpf_map_delete_elem(map, void *key); + +The ``flags`` argument in ``bpf_map_update_elem()`` allows to define +semantics on whether the element exists: + +.. code-block:: c + + /* File: include/uapi/linux/bpf.h */ + /* flags for BPF_MAP_UPDATE_ELEM command */ + #define BPF_ANY 0 /* create new element or update existing */ + #define BPF_NOEXIST 1 /* create new element only if it didn't exist */ + #define BPF_EXIST 2 /* only update existing element */ + +Userspace +--------- +The userspace API map helpers are defined in `tools/lib/bpf/bpf.h`_ +and looks like this: + +.. code-block:: c + + /* Userspace helpers */ + int bpf_map_lookup_elem(int fd, void *key, void *value); + int bpf_map_update_elem(int fd, void *key, void *value, __u64 flags); + int bpf_map_delete_elem(int fd, void *key); + /* Only userspace: */ + int bpf_map_get_next_key(int fd, void *key, void *next_key); + + +Interacting with an eBPF map from **userspace**, happens through the +`bpf`_ syscall and a file descriptor. See how the map handle ``int +fd`` is a file descriptor . On success, zero is returned, on +failures -1 is returned and errno is set. + +Wrappers for the bpf syscall is implemented in `tools/lib/bpf/bpf.c`_, +and ends up calling functions in `kernel/bpf/syscall.c`_, like +`map_lookup_elem`_. + +.. code-block:: c + + /* Corresponding syscall bpf commands from userspace */ + enum bpf_cmd { + [...] + BPF_MAP_LOOKUP_ELEM, + BPF_MAP_UPDATE_ELEM, + BPF_MAP_DELETE_ELEM, + BPF_MAP_GET_NEXT_KEY, + [...] + }; + +Notice how ``void *key`` and ``void *value`` are passed as a void +pointers. Given the memory seperation between kernel and userspace, +this is a copy of the value. Kernel primitives like +``copy_from_user()`` and ``copy_to_user()`` are used, e.g. see +`map_lookup_elem`_, which also kmalloc+kfree memory for a short +period. + +From userspace, there is no function call to atomically increment or +decrement the value 'in-place'. The bpf_map_update_elem() call will +overwrite the existing value, with a copy of the value supplied. +Depending on the map type, the overwrite will happen in an atomic way, +e.g. using locking mechanisms specific to the map type. + +.. section links + +.. _tools/lib/bpf/bpf.h: + https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/tools/lib/bpf/bpf.h + +.. _tools/lib/bpf/bpf.c: + https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/tools/lib/bpf/bpf.c + +.. _map_lookup_elem: http://lxr.free-electrons.com/ident?i=map_lookup_elem + +.. _kernel/bpf/syscall.c: + https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/syscall.c + + +Kernel-side eBPF program +------------------------ + +The API mapping for eBPF programs on the kernel-side is fairly hard to +follow. It related to `samples/bpf/bpf_helpers.h`_ and maps into +`kernel/bpf/helpers.c`_ via macros. + +.. code-block:: c + + /* eBPF program helpers */ + void *bpf_map_lookup_elem(void *map, void *key); + int bpf_map_update_elem(void *map, void *key, void *value, unsigned long long flags); + int bpf_map_delete_elem(void *map, void *key); + +The eBPF-program running kernel-side interacts more directly with the +map data structures. For example the call ``bpf_map_lookup_elem()`` +returns a direct pointer to the 'value' memory-element inside the +kernel (while userspace gets a copy). This allows the eBPF-program to +atomically increment or decrement the value 'in-place', by using +appropiate compiler primitives like ``__sync_fetch_and_add()``, which +is understood by LLVM when generating eBPF instructions. + +.. section links + +.. _samples/bpf/bpf_helpers.h: + https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/samples/bpf/bpf_helpers.h + +.. _kernel/bpf/helpers.c: + https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/helpers.c + +.. links + +.. _bpf(2): http://man7.org/linux/man-pages/man2/bpf.2.html + +.. _bpf: http://man7.org/linux/man-pages/man2/bpf.2.html diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst index f262fe8f9f95..738d17aaf6d7 100644 --- a/Documentation/bpf/index.rst +++ b/Documentation/bpf/index.rst @@ -42,6 +42,8 @@ covered by this documentation. .. toctree:: :maxdepth: 1 + ebpf_maps + .. links: .. _article 1992: http://www.tcpdump.org/papers/bpf-usenix93.pdf -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html