Re: [PATCH bpf v2 4/5] bpf: Optimize the free of inner map

Martin KaFai Lau <martin.lau@xxxxxxxxx> · Fri, 17 Nov 2023 23:34:39 -0800

On 11/13/23 4:33 AM, Hou Tao wrote:
From: Hou Tao <houtao1@xxxxxxxxxx>

When removing the inner map from the outer map, the inner map will be
freed after one RCU grace period and one RCU tasks trace grace
period, so it is certain that the bpf program which may access the inner
map, has exited before the inner map is freed.

However there is unnecessary to wait for one RCU grace period or one RCU
tasks trace grace period if the outer map is only accessed by sleepable
program or non-sleepable program. So recording the context of the owned
bpf programs when adding map into env->used_maps and using the recorded
access context to decide which, and how many, RCU grace periods are
needed when freeing the inner map.

Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx>
---
  include/linux/bpf.h     |  9 ++++++++-
  kernel/bpf/map_in_map.c | 18 +++++++++++++-----
  kernel/bpf/syscall.c    | 15 +++++++++++++--
  kernel/bpf/verifier.c   |  5 +++++
  4 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ec3c90202ffe6..8faa1af4b39df 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -245,6 +245,12 @@ struct bpf_list_node_kern {
  	void *owner;
  } __attribute__((aligned(8)));
  
+enum {
+	BPF_MAP_ACC_NORMAL_PROG_CTX = 1,
+	BPF_MAP_ACC_SLEEPABLE_PROG_CTX = 2,

nit. It is a bit flag. Use (1U << 0) and (1U << 1) to make it more obvious.

How about renaming this to BPF_MAP_RCU_GP and BPF_MAP_RCU_TT_GP to better 
reflect it is used to decide if the map_free needs to wait for any rcu gp?

+	BPF_MAP_ACC_PROG_CTX_MASK = BPF_MAP_ACC_NORMAL_PROG_CTX | BPF_MAP_ACC_SLEEPABLE_PROG_CTX,
+};
+
  struct bpf_map {
  	/* The first two cachelines with read-mostly members of which some
  	 * are also accessed in fast-path (e.g. ops, max_entries).
@@ -292,7 +298,8 @@ struct bpf_map {
  	} owner;
  	bool bypass_spec_v1;
  	bool frozen; /* write-once; write-protected by freeze_mutex */
-	bool free_after_mult_rcu_gp;
+	atomic_t owned_prog_ctx;

Instead of the enum flags, this should only need a true/false value to tell 
whether the outer map has ever been used by a sleepable prog or not. may be 
renaming this to "atomic used_by_sleepable;"?

+	atomic_t may_be_accessed_prog_ctx;

nit. rename this to "rcu_gp_flags;"

  	s64 __percpu *elem_count;
  };
  
diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index cf33630655661..e3d26a89ac5b6 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -131,12 +131,20 @@ void bpf_map_fd_put_ptr(struct bpf_map *map, void *ptr, bool deferred)
  {
  	struct bpf_map *inner_map = ptr;
  
-	/* The inner map may still be used by both non-sleepable and sleepable
-	 * bpf program, so free it after one RCU grace period and one tasks
-	 * trace RCU grace period.
+	/* Defer the freeing of inner map according to the owner program
+	 * context of outer maps, so unnecessary multiple RCU GP waitings
+	 * can be avoided.
  	 */
-	if (deferred)
-		WRITE_ONCE(inner_map->free_after_mult_rcu_gp, true);
+	if (deferred) {
+		/* owned_prog_ctx may be updated concurrently by new bpf program
+		 * so add smp_mb() below to ensure that reading owned_prog_ctx
+		 * will return the newly-set bit when the new bpf program finds
+		 * the inner map before it is removed from outer map.
+		 */
+		smp_mb();

This part took my head spinning a little, so it is better to ask. The 
owned_prog_ctx is set during verification time. There are many instructions till 
the prog is actually verified, attached (another syscall) and then run to do the 
actual lookup(&outer_map). Is this level of reordering practically possible?

+		atomic_or(atomic_read(&map->owned_prog_ctx),
+			  &inner_map->may_be_accessed_prog_ctx);
+	}
  	bpf_map_put(inner_map);
  }
  
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e2d2701ce2c45..5a7906f2b027e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -694,12 +694,20 @@ static void bpf_map_free_deferred(struct work_struct *work)
  {
  	struct bpf_map *map = container_of(work, struct bpf_map, work);
  	struct btf_record *rec = map->record;
+	int acc_ctx;
  
  	security_bpf_map_free(map);
  	bpf_map_release_memcg(map);
  
-	if (READ_ONCE(map->free_after_mult_rcu_gp))
-		synchronize_rcu_mult(call_rcu, call_rcu_tasks_trace);
+	acc_ctx = atomic_read(&map->may_be_accessed_prog_ctx) & BPF_MAP_ACC_PROG_CTX_MASK;

The mask should not be needed.

+	if (acc_ctx) {
+		if (acc_ctx == BPF_MAP_ACC_NORMAL_PROG_CTX)
+			synchronize_rcu();
+		else if (acc_ctx == BPF_MAP_ACC_SLEEPABLE_PROG_CTX)
+			synchronize_rcu_tasks_trace();
+		else
+			synchronize_rcu_mult(call_rcu, call_rcu_tasks_trace);

Is it better to add a rcu_head to the map and then use call_rcu_(). e.g. when 
there is many delete happened to the outer map during a process restart to 
re-populate the outer map. It is relatively much cheaper to add a rcu_head to 
the map comparing to adding one for each elem. wdyt?

+	}
  
  	/* implementation dependent freeing */
  	map->ops->map_free(map);
@@ -5326,6 +5334,9 @@ static int bpf_prog_bind_map(union bpf_attr *attr)
  		goto out_unlock;
  	}
  
+	/* No need to update owned_prog_ctx, because the bpf program doesn't
+	 * access the map.
+	 */
  	memcpy(used_maps_new, used_maps_old,
  	       sizeof(used_maps_old[0]) * prog->aux->used_map_cnt);
  	used_maps_new[prog->aux->used_map_cnt] = map;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index bd1c42eb540f1..d8d5432b240dc 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -18012,12 +18012,17 @@ static int resolve_pseudo_ldimm64(struct bpf_verifier_env *env)
  				return -E2BIG;
  			}
  
+			atomic_or(env->prog->aux->sleepable ? BPF_MAP_ACC_SLEEPABLE_PROG_CTX :
+							      BPF_MAP_ACC_NORMAL_PROG_CTX,
+				  &map->owned_prog_ctx);
  			/* hold the map. If the program is rejected by verifier,
  			 * the map will be released by release_maps() or it
  			 * will be used by the valid program until it's unloaded
  			 * and all maps are released in free_used_maps()
  			 */
  			bpf_map_inc(map);
+			/* Paired with smp_mb() in bpf_map_fd_put_ptr() */
+			smp_mb__after_atomic();
  
  			aux->map_index = env->used_map_cnt;
  			env->used_maps[env->used_map_cnt++] = map;