It is difficult to intuit the semantics of owning and non-owning references from verifier code. In order to keep the high-level details from being lost in the mailing list, this patch adds documentation explaining semantics and details. The target audience of doc added in this patch is folks working on BPF internals, as there's focus on "what should the verifier do here". Via reorganization or copy-and-paste, much of the content can probably be repurposed for BPF program writer audience as well. Signed-off-by: Dave Marchevsky <davemarchevsky@xxxxxx> --- Documentation/bpf/graph_ds_impl.rst | 208 ++++++++++++++++++++++++++++ Documentation/bpf/other.rst | 3 +- 2 files changed, 210 insertions(+), 1 deletion(-) create mode 100644 Documentation/bpf/graph_ds_impl.rst diff --git a/Documentation/bpf/graph_ds_impl.rst b/Documentation/bpf/graph_ds_impl.rst new file mode 100644 index 000000000000..f92cbd223dc3 --- /dev/null +++ b/Documentation/bpf/graph_ds_impl.rst @@ -0,0 +1,208 @@ +========================= +BPF Graph Data Structures +========================= + +This document describes implementation details of new-style "graph" data +structures (linked_list, rbtree), with particular focus on verifier +implementation of semantics particular to those data structures. + +Note that the intent of this document is to describe the current state of +these graph data structures, **no guarantees** of stability for either +semantics or APIs are made or implied here. + +.. contents:: + :local: + :depth: 2 + +Introduction +------------ + +The BPF map API has historically been the main way to expose data structures +of various types for use within BPF programs. Some data structures fit naturally +with the map API (HASH, ARRAY), others less so. Consequentially, programs +interacting with the latter group of data structures can be hard to parse +for kernel programmers without previous BPF experience. + +Luckily, some restrictions which necessitated the use of BPF map semantics are +no longer relevant. With the introduction of kfuncs, kptrs, and the any-context +BPF allocator, it is now possible to implement BPF data structures whose API +and semantics more closely match those exposed to the rest of the kernel. + +Two such data structures - linked_list and rbtree - have many verification +details in common. Because both have "root"s ("head" for linked_list) and +"node"s, the verifier code and this document refer to common functionality +as "graph_api", "graph_root", "graph_node", etc. + +Unless otherwise stated, examples and semantics below apply to both graph data +structures. + +Non-owning references +--------------------- + +**Motivation** + +Consider the following BPF code: + +.. code-block:: c + struct node_data *n = bpf_obj_new(typeof(*n)); /* BEFORE */ + + bpf_spin_lock(&lock); + + bpf_rbtree_add(&tree, n); /* AFTER */ + + bpf_spin_unlock(&lock); +---- + +From the verifier's perspective, after bpf_obj_new ``n`` has type +``PTR_TO_BTF_ID | MEM_ALLOC`` with btf_id of ``struct node_data`` and a +nonzero ``ref_obj_id``. Because it holds ``n``, the program has ownership +of the pointee's lifetime (object pointed to by ``n``). The BPF program must +pass off ownership before exiting - either via ``bpf_obj_drop``, which free's +the object, or by adding it to ``tree`` with ``bpf_rbtree_add``. + +(``BEFORE`` and ``AFTER`` comments in the example denote beginning of "before +ownership is passed" and "after ownership is passed") + +What should the verifier do with ``n`` after ownership is passed off? If the +object was free'd with ``bpf_obj_drop`` the answer is obvious: the verifier +should reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as +the object is no longer valid. The underlying memory may have been reused for +some other allocation, unmapped, etc. + +When ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less +obvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``, +but that would result in programs with useful, common coding patterns being +rejected, e.g.: + +.. code-block:: c + int x; + struct node_data *n = bpf_obj_new(typeof(*n)); /* BEFORE */ + + bpf_spin_lock(&lock); + + bpf_rbtree_add(&tree, n); /* AFTER */ + x = n->data; + n->data = 42; + + bpf_spin_unlock(&lock); +---- + +Both the read from and write to ``n->data`` would be rejected. The verifier +can do better, though, by taking advantage of two details: + + * Graph data structure APIs can only be used when the ``bpf_spin_lock`` + associated with the graph root is held + * Both graph data structures have pointer stability + * Because graph nodes are allocated with ``bpf_obj_new`` and + adding / removing from the root involves fiddling with the + ``bpf_{list,rb}_node`` field of the node struct, a graph node will + remain at the same address after either operation. + +Because the associated ``bpf_spin_lock`` must be held by any program adding +or removing, if we're in the critical section bounded by that lock, we know +that no other program can add or remove until the end of the critical section. +This combined with pointer stability means that, until the critical section +ends, we can safely access the graph node through ``n`` even after it was used +to pass ownership. + +The verifier considers such a reference a *non-owning reference*. The ref +returned by ``bpf_obj_new`` is accordingly considered an *owning reference*. +Both terms currently only have meaning in the context of graph nodes and API. + +**Details** + +Let's enumerate the properties of both types of references. + +*owning reference* + + * This reference controls the lifetime of the pointee + * Ownership of pointee must be 'released' by passing it to some graph API + kfunc, or via ``bpf_obj_drop``, which free's the pointee + * If not released before program ends, verifier considers program invalid + * Access to the pointee's memory will not page fault + +*non-owning reference* + + * This reference does not own the pointee + * It cannot be used to add the graph node to a graph root, nor free via + ``bpf_obj_drop`` + * No explicit control of lifetime, but can infer valid lifetime based on + non-owning ref existence (see explanation below) + * Access to the pointee's memory will not page fault + +From verifier's perspective non-owning references can only exist +between spin_lock and spin_unlock. Why? After spin_unlock another program +can do arbitrary operations on the data structure like removing and free-ing +via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd, +free'd, and reused via bpf_obj_new would point to an entirely different thing. +Or the memory could go away. + +To prevent this logic violation all non-owning references are invalidated by +verifier after critical section ends. This is necessary to ensure "will +not page fault" property of non-owning reference. So if verifier hasn't +invalidated a non-owning ref, accessing it will not page fault. + +Currently ``bpf_obj_drop`` is not allowed in the critical section, so +if there's a valid non-owning ref, we must be in critical section, and can +conclude that the ref's memory hasn't been dropped-and-free'd or dropped- +and-reused. + +Any reference to a node that is in a rbtree _must_ be non-owning, since +the tree has control of pointee lifetime. Similarly, any ref to a node +that isn't in rbtree _must_ be owning. This results in a nice property: +graph API add / remove implementations don't need to check if a node +has already been added (or already removed), as the verifier type system +prevents such a state from being valid. + +However, pointer aliasing poses an issue for the above "nice property". +Consider the following example: + +.. code-block:: c + struct node_data *n, *m, *o, *p; + n = bpf_obj_new(typeof(*n)); /* 1 */ + + bpf_spin_lock(&lock); + + bpf_rbtree_add(&tree, n); /* 2 */ + m = bpf_rbtree_first(&tree); /* 3 */ + + o = bpf_rbtree_remove(&tree, n); /* 4 */ + p = bpf_rbtree_remove(&tree, m); /* 5 */ + + bpf_spin_unlock(&lock); + + bpf_obj_drop(o); + bpf_obj_drop(p); /* 6 */ +---- + +Assume tree is empty before this program runs. If we track verifier state +changes here using numbers in above comments: + + 1) n is an owning reference + 2) n is a non-owning reference, it's been added to the tree + 3) n and m are non-owning references, they both point to the same node + 4) o is an owning reference, n and m non-owning, all point to same node + 5) o and p are owning, n and m non-owning, all point to the same node + 6) a double-free has occurred, since o and p point to same node and o was + free'd in previous statement + +States 4 and 5 violate our "nice property", as there are non-owning refs to +a node which is not in a rbtree. Statement 5 will try to remove a node which +has already been removed as a result of this violation. State 6 is a dangerous +double-free. + +At a minimum we should prevent state 6 from being possible. If we can't also +prevent state 5 then we must abandon our "nice property" and check whether a +node has already been removed at runtime. + +We prevent both by generalizing the "invalidate non-owning references" behavior +of ``bpf_spin_unlock`` and doing similar invalidation after +``bpf_rbtree_remove``. The logic here being that any graph API kfunc which: + + * takes an arbitrary node argument + * removes it from the datastructure + * returns an owning reference to the removed node + +May result in a state where some other non-owning reference points to the same +node. So ``remove``-type kfuncs must be considered a non-owning reference +invalidation point as well. diff --git a/Documentation/bpf/other.rst b/Documentation/bpf/other.rst index 3d61963403b4..7e6b12018802 100644 --- a/Documentation/bpf/other.rst +++ b/Documentation/bpf/other.rst @@ -6,4 +6,5 @@ Other :maxdepth: 1 ringbuf - llvm_reloc \ No newline at end of file + llvm_reloc + graph_ds_impl -- 2.30.2