On 3/12/23 12:05 PM, Joe Stringer wrote:
Extend the bpf hashmap docs to include a brief description of the
internals of the LRU map type (setting appropriate API expectations),
including the original commit message from Martin and a variant on the
graph that I had presented during my Linux Plumbers Conference 2022 talk
on "Pressure feedback for LRU map types"[0].
The node names in the dot file correspond roughly to the functions where
the logic for those decisions or steps is defined, to help curious
developers to cross-reference and update this logic if the details of
the LRU implementation ever differ from this description.
[0]: https://lpc.events/event/16/contributions/1368/
Signed-off-by: Joe Stringer <joe@xxxxxxxxxxxxx>
---
v3: Use standard table syntax
Replace inline commit message with reference to commit
Fix incorrect Y/N label for common LRU check
Rename some dotfile variables to reduce confusion between cases
Minor wording touchups
v2: Fix issue that caused initial email submission to fail
---
Documentation/bpf/map_hash.rst | 62 ++++++++
Documentation/bpf/map_lru_hash_update.dot | 166 ++++++++++++++++++++++
2 files changed, 228 insertions(+)
create mode 100644 Documentation/bpf/map_lru_hash_update.dot
diff --git a/Documentation/bpf/map_hash.rst b/Documentation/bpf/map_hash.rst
index 8669426264c6..61602ce26561 100644
--- a/Documentation/bpf/map_hash.rst
+++ b/Documentation/bpf/map_hash.rst
@@ -1,5 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0-only
.. Copyright (C) 2022 Red Hat, Inc.
+.. Copyright (C) 2022-2023 Isovalent, Inc.
===============================================
BPF_MAP_TYPE_HASH, with PERCPU and LRU Variants
@@ -206,3 +207,64 @@ Userspace walking the map elements from the map declared above:
cur_key = &next_key;
}
}
+
+Internals
+=========
+
+This section of the document is targeted at Linux developers and describes
+aspects of the map implementations that are not considered stable ABI. The
+following details are subject to change in future versions of the kernel.
+
+``BPF_MAP_TYPE_LRU_HASH`` and variants
+--------------------------------------
+
+An LRU hashmap type consists of two properties: Firstly, it is a hash map and
+hence is indexable by key for constant time lookups. Secondly, when at map
+capacity, map updates will trigger eviction of old entries based on the age of
+the elements in a set of lists. Each of these properties may be either global
+or per-CPU, depending on the map type and flags used to create the map:
+
++------------------------+---------------------------+----------------------------------+
+| | ``BPF_MAP_TYPE_LRU_HASH`` | ``BPF_MAP_TYPE_LRU_PERCPU_HASH`` |
++========================+===========================+==================================+
+| ``BPF_NO_COMMON_LRU`` | Per-CPU LRU, global map | Per-CPU LRU, per-cpu map |
++------------------------+---------------------------+----------------------------------+
+| ``!BPF_NO_COMMON_LRU`` | Global LRU, global map | Global LRU, per-cpu map |
++------------------------+---------------------------+----------------------------------+
+
+Notably, there are various steps that the update algorithm attempts in order to
+enforce the LRU property which have increasing impacts on other CPUs involved
+in the following operation attempts:
+
+- Attempt to use CPU-local state to batch operations
+- Attempt to fetch free nodes from global lists
+- Attempt to pull any node from a global list and remove it from the hashmap
+- Attempt to pull any node from any CPU's list and remove it from the hashmap
+
+Even if an LRU node may be acquired, maps of type ``BPF_MAP_TYPE_LRU_HASH``
+may fail to insert the entry into the map if other CPUs are heavily contending
+on the global hashmap lock.
The global hashmap lock described here is the action taken in htab_lock_bucket()?
It is a percpu counter added in commit 20b6cc34ea74 ("bpf: Avoid hashtab
deadlock with map_locked") to avoid deadlock/recursion.
I would suggest to simplify the diagram by removing the "Can lock this hashtab
bucket?" details. May be a note somewhere to mention why it will still fail to
shrink the list because the htab_lock_bucket() have detected potential
deadlock/recursion which is a very unlikely case.
Thanks for the write-up!