Re: [PATCH bpf-next v3] docs/bpf: Add LRU internals description and graph

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Mar 12, 2023 at 12:05:59PM -0700, Joe Stringer wrote:
> Extend the bpf hashmap docs to include a brief description of the
> internals of the LRU map type (setting appropriate API expectations),
> including the original commit message from Martin and a variant on the
> graph that I had presented during my Linux Plumbers Conference 2022 talk
> on "Pressure feedback for LRU map types"[0].
> 
> The node names in the dot file correspond roughly to the functions where
> the logic for those decisions or steps is defined, to help curious
> developers to cross-reference and update this logic if the details of
> the LRU implementation ever differ from this description.
> 
> [0]: https://lpc.events/event/16/contributions/1368/
> 
> Signed-off-by: Joe Stringer <joe@xxxxxxxxxxxxx>
> ---
> v3: Use standard table syntax
>     Replace inline commit message with reference to commit
>     Fix incorrect Y/N label for common LRU check
>     Rename some dotfile variables to reduce confusion between cases
>     Minor wording touchups
> v2: Fix issue that caused initial email submission to fail
> ---
>  Documentation/bpf/map_hash.rst            |  62 ++++++++
>  Documentation/bpf/map_lru_hash_update.dot | 166 ++++++++++++++++++++++
>  2 files changed, 228 insertions(+)
>  create mode 100644 Documentation/bpf/map_lru_hash_update.dot
> 
> diff --git a/Documentation/bpf/map_hash.rst b/Documentation/bpf/map_hash.rst
> index 8669426264c6..61602ce26561 100644
> --- a/Documentation/bpf/map_hash.rst
> +++ b/Documentation/bpf/map_hash.rst
> @@ -1,5 +1,6 @@
>  .. SPDX-License-Identifier: GPL-2.0-only
>  .. Copyright (C) 2022 Red Hat, Inc.
> +.. Copyright (C) 2022-2023 Isovalent, Inc.
>  
>  ===============================================
>  BPF_MAP_TYPE_HASH, with PERCPU and LRU Variants
> @@ -206,3 +207,64 @@ Userspace walking the map elements from the map declared above:
>                      cur_key = &next_key;
>              }
>      }
> +
> +Internals
> +=========
> +
> +This section of the document is targeted at Linux developers and describes
> +aspects of the map implementations that are not considered stable ABI. The
> +following details are subject to change in future versions of the kernel.
> +
> +``BPF_MAP_TYPE_LRU_HASH`` and variants
> +--------------------------------------
> +
> +An LRU hashmap type consists of two properties: Firstly, it is a hash map and
> +hence is indexable by key for constant time lookups. Secondly, when at map
> +capacity, map updates will trigger eviction of old entries based on the age of
> +the elements in a set of lists. Each of these properties may be either global
> +or per-CPU, depending on the map type and flags used to create the map:
> +
> ++------------------------+---------------------------+----------------------------------+
> +|                        | ``BPF_MAP_TYPE_LRU_HASH`` | ``BPF_MAP_TYPE_LRU_PERCPU_HASH`` |
> ++========================+===========================+==================================+
> +| ``BPF_NO_COMMON_LRU``  | Per-CPU LRU, global map   | Per-CPU LRU, per-cpu map         |
> ++------------------------+---------------------------+----------------------------------+
> +| ``!BPF_NO_COMMON_LRU`` | Global LRU, global map    | Global LRU, per-cpu map          |
> ++------------------------+---------------------------+----------------------------------+
> +
> +Notably, there are various steps that the update algorithm attempts in order to
> +enforce the LRU property which have increasing impacts on other CPUs involved
> +in the following operation attempts:
> +
> +- Attempt to use CPU-local state to batch operations
> +- Attempt to fetch free nodes from global lists
> +- Attempt to pull any node from a global list and remove it from the hashmap
> +- Attempt to pull any node from any CPU's list and remove it from the hashmap
> +
> +Even if an LRU node may be acquired, maps of type ``BPF_MAP_TYPE_LRU_HASH``
> +may fail to insert the entry into the map if other CPUs are heavily contending
> +on the global hashmap lock.
> +
> +This algorithm is described visually in the following diagram. See the
> +description in commit 3a08c2fd7634 ("bpf: LRU List") for a full explanation of
> +the corresponding operations:
> +
> +.. kernel-figure::  map_lru_hash_update.dot
> +   :alt:    Diagram outlining the LRU eviction steps taken during map update
> +
> +   LRU hash eviction during map update for ``BPF_MAP_TYPE_LRU_HASH`` and
> +   variants
> +
> +Map updates start from the oval in the top right "begin ``bpf_map_update()``"
> +and progress through the graph towards the bottom where the result may be
> +either a successful update or a failure with various error codes. The key in
> +the top right provides indicators for which locks may be involved in specific
> +operations. This is intended as a visual hint for reasoning about how map
> +contention may impact update operations, though the map type and flags may
> +impact the actual contention on those locks, based on the logic described in
> +the table above. For instance, if the map is created with type
> +``BPF_MAP_TYPE_LRU_PERCPU_HASH`` and flags ``BPF_NO_COMMON_LRU`` then all map
> +properties would be per-cpu.
> +
> +The dot file source for the above figure uses internal kernel function names
> +for the node names in order to make the corresponding logic easier to find.
> diff --git a/Documentation/bpf/map_lru_hash_update.dot b/Documentation/bpf/map_lru_hash_update.dot
> new file mode 100644
> index 000000000000..3a44ebec501e
> --- /dev/null
> +++ b/Documentation/bpf/map_lru_hash_update.dot
> @@ -0,0 +1,166 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +// Copyright (C) 2022-2023 Isovalent, Inc.
> +digraph {
> +  node [colorscheme=accent4,style=filled] # Apply colorscheme to all nodes
> +  graph [splines=ortho, nodesep=1]
> +
> +  subgraph cluster_key {
> +    label = "Key\n(locks held during operation)";
> +    rankdir = TB;
> +
> +    remote_lock [shape=rectangle,fillcolor=4,label="remote CPU LRU lock"]
> +    hash_lock [shape=rectangle,fillcolor=3,label="hashtab lock"]
> +    lru_lock [shape=rectangle,fillcolor=2,label="LRU lock"]
> +    local_lock [shape=rectangle,fillcolor=1,label="local CPU LRU lock"]
> +    no_lock [shape=rectangle,label="no locks held"]
> +  }
> +
> +  begin [shape=oval,label="begin\nbpf_map_update()"]
> +
> +  // Nodes below with an 'fn_' prefix are roughly labeled by the C function
> +  // names that initiate the corresponding logic in kernel/bpf/bpf_lru_list.c.
> +  // Number suffixes and errno suffixes handle subsections of the corresponding
> +  // logic in the function as of the writing of this dot.
> +
> +  // The following corresponds to bpf_lru_pop_free() for common LRU case.

The comment that points to the function and explains that it's
applicable only to one case is a great idea, it's much clearer compared
to the previous version.

I believe there are some inaccuracies, though. As far as I see it,
local_freelist_check corresponds to __local_list_pop_free in the common
LRU case, specifically, to checking its return value; use_local_node
corresponds to returning that value; and common_lru_check corresponds
to bpf_lru_pop_free (for both common and percpu LRU, that's where the
distinction is made).

> +  local_freelist_check [shape=diamond,fillcolor=1,
> +    label="Local freelist\nnode available?"];
> +  // The following corresponds to __local_list_pop_free() for common LRU case.
> +  use_local_node [shape=rectangle,
> +    label="Use node owned\nby this CPU"]
> +
> +  common_lru_check [shape=diamond,
> +    label="Map created with\ncommon LRU?\n(!BPF_NO_COMMON_LRU)"];

Nit: the exact flag name is BPF_F_NO_COMMON_LRU.

> +
> +  fn_bpf_lru_list_pop_free_to_local [shape=rectangle,fillcolor=2,
> +    label="Flush local pending,
> +    Rotate Global list, move
> +    LOCAL_FREE_TARGET
> +    from global -> local"]
> +  // Also corresponds to:
> +  // fn__local_list_flush()
> +  // fn_bpf_lru_list_rotate()
> +  fn___bpf_lru_node_move_to_free[shape=diamond,fillcolor=2,
> +    label="Able to free\nLOCAL_FREE_TARGET\nnodes?"]
> +
> +  fn___bpf_lru_list_shrink_inactive [shape=rectangle,fillcolor=3,
> +    label="Shrink inactive list
> +      up to remaining
> +      LOCAL_FREE_TARGET
> +      (global LRU -> local)"]
> +  fn___bpf_lru_list_shrink [shape=diamond,fillcolor=2,
> +    label="> 0 entries in\nlocal free list?"]
> +  fn___bpf_lru_list_shrink2 [shape=rectangle,fillcolor=2,
> +    label="Steal one node from
> +      inactive, or if empty,
> +      from active global list"]
> +  fn___bpf_lru_list_shrink3 [shape=rectangle,fillcolor=3,
> +    label="Try to remove\nnode from hashtab"]
> +
> +  local_freelist_check2 [shape=diamond,label="Htab removal\nsuccessful?"]
> +  common_lru_check2 [shape=diamond,
> +    label="Map created with\ncommon LRU?\n(!BPF_NO_COMMON_LRU)"];
> +
> +  subgraph cluster_remote_lock {
> +    label = "Iterate through CPUs\n(start from current)";
> +    style = dashed;
> +    rankdir=LR;
> +
> +    local_freelist_check5 [shape=diamond,fillcolor=4,
> +      label="Steal a node from\nper-cpu freelist?"]
> +    local_freelist_check6 [shape=rectangle,fillcolor=4,
> +      label="Steal a node from
> +        (1) Unreferenced pending, or
> +        (2) Any pending node"]
> +    local_freelist_check7 [shape=rectangle,fillcolor=3,
> +      label="Try to remove\nnode from hashtab"]
> +    fn_htab_lru_map_update_elem [shape=diamond,
> +      label="Stole node\nfrom remote\nCPU?"]
> +    fn_htab_lru_map_update_elem2 [shape=diamond,label="Iterated\nall CPUs?"]
> +    // Also corresponds to:
> +    // use_local_node()
> +    // fn__local_list_pop_pending()
> +  }
> +
> +  fn_bpf_lru_list_pop_free_to_local2 [shape=rectangle,
> +    label="Use node that was\nnot recently referenced"]
> +  local_freelist_check4 [shape=rectangle,
> +    label="Use node that was\nactively referenced\nin global list"]
> +  fn_htab_lru_map_update_elem_ENOMEM [shape=oval,label="return -ENOMEM"]
> +  fn_htab_lru_map_update_elem3 [shape=rectangle,
> +    label="Use node that was\nactively referenced\nin (another?) CPU's cache"]
> +  fn_htab_lru_map_update_elem4 [shape=diamond,
> +    label="Can lock this\nhashtab bucket?"]
> +  fn_htab_lru_map_update_elem5 [shape=rectangle,fillcolor=3,
> +    label="Update hashmap\nwith new element"]
> +  fn_htab_lru_map_update_elem6 [shape=oval,label="return 0"]
> +  fn_htab_lru_map_update_elem_EBUSY [shape=oval,label="return -EBUSY"]
> +
> +  begin -> local_freelist_check
> +  local_freelist_check -> use_local_node [xlabel="Y"]
> +  local_freelist_check -> common_lru_check [xlabel="N"]
> +  common_lru_check -> fn_bpf_lru_list_pop_free_to_local [xlabel="Y"]
> +  common_lru_check -> fn___bpf_lru_list_shrink_inactive [xlabel="N"]
> +  fn_bpf_lru_list_pop_free_to_local -> fn___bpf_lru_node_move_to_free
> +  fn___bpf_lru_node_move_to_free ->
> +    fn_bpf_lru_list_pop_free_to_local2 [xlabel="Y"]
> +  fn___bpf_lru_node_move_to_free ->
> +    fn___bpf_lru_list_shrink_inactive [xlabel="N"]
> +  fn___bpf_lru_list_shrink_inactive -> fn___bpf_lru_list_shrink
> +  fn___bpf_lru_list_shrink -> fn_bpf_lru_list_pop_free_to_local2 [xlabel = "Y"]
> +  fn___bpf_lru_list_shrink -> fn___bpf_lru_list_shrink2 [xlabel="N"]
> +  fn___bpf_lru_list_shrink2 -> fn___bpf_lru_list_shrink3
> +  fn___bpf_lru_list_shrink3 -> local_freelist_check2
> +  local_freelist_check2 -> local_freelist_check4 [xlabel = "Y"]
> +  local_freelist_check2 -> common_lru_check2 [xlabel = "N"]
> +  common_lru_check2 -> local_freelist_check5 [xlabel = "Y"]
> +  common_lru_check2 -> fn_htab_lru_map_update_elem_ENOMEM [xlabel = "N"]
> +  local_freelist_check5 -> fn_htab_lru_map_update_elem [xlabel = "Y"]
> +  local_freelist_check5 -> local_freelist_check6 [xlabel = "N"]
> +  local_freelist_check6 -> local_freelist_check7
> +  local_freelist_check7 -> fn_htab_lru_map_update_elem
> +
> +  fn_htab_lru_map_update_elem -> fn_htab_lru_map_update_elem3 [xlabel = "Y"]
> +  fn_htab_lru_map_update_elem -> fn_htab_lru_map_update_elem2  [xlabel = "N"]
> +  fn_htab_lru_map_update_elem2 ->
> +    fn_htab_lru_map_update_elem_ENOMEM [xlabel = "Y"]
> +  fn_htab_lru_map_update_elem2 -> local_freelist_check5 [xlabel = "N"]
> +  fn_htab_lru_map_update_elem3 -> fn_htab_lru_map_update_elem4
> +
> +  use_local_node -> fn_htab_lru_map_update_elem4
> +  fn_bpf_lru_list_pop_free_to_local2 -> fn_htab_lru_map_update_elem4
> +  local_freelist_check4 -> fn_htab_lru_map_update_elem4
> +
> +  fn_htab_lru_map_update_elem4 -> fn_htab_lru_map_update_elem5 [xlabel="Y"]
> +  fn_htab_lru_map_update_elem4 ->
> +    fn_htab_lru_map_update_elem_EBUSY [xlabel="N"]
> +  fn_htab_lru_map_update_elem5 -> fn_htab_lru_map_update_elem6
> +
> +  // Create invisible pad nodes to line up various nodes
> +  pad0 [style=invis]
> +  pad1 [style=invis]
> +  pad2 [style=invis]
> +  pad3 [style=invis]
> +  pad4 [style=invis]
> +
> +  // Line up the key with the top of the graph
> +  no_lock -> local_lock [style=invis]
> +  local_lock -> lru_lock [style=invis]
> +  lru_lock -> hash_lock [style=invis]
> +  hash_lock -> remote_lock [style=invis]
> +  remote_lock -> local_freelist_check5 [style=invis]
> +  remote_lock -> fn___bpf_lru_list_shrink [style=invis]
> +
> +  // Line up return code nodes at the bottom of the graph
> +  fn_htab_lru_map_update_elem -> pad0 [style=invis]
> +  pad0 -> pad1 [style=invis]
> +  pad1 -> pad2 [style=invis]
> +  pad2-> fn_htab_lru_map_update_elem_ENOMEM [style=invis]
> +  fn_htab_lru_map_update_elem4 -> pad3 [style=invis]
> +  pad3 -> fn_htab_lru_map_update_elem_EBUSY  [style=invis]
> +
> +  // Reduce diagram width by forcing some nodes to appear above others
> +  local_freelist_check4 -> fn_htab_lru_map_update_elem3 [style=invis]
> +  common_lru_check2 -> pad4 [style=invis]
> +  pad4 -> local_freelist_check5 [style=invis]
> +}
> -- 
> 2.34.1
> 

Thanks again for this patch, this piece of documentation really helped
me understand internals of the LRU hashmap.



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux