Re: [PATCH bpf-next v6 1/8] bpf: Retire the struct_ops map kvalue->refcnt.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3/9/23 8:38 PM, Kui-Feng Lee wrote:
We have replaced kvalue-refcnt with synchronize_rcu() to wait for an
RCU grace period.

Maintenance of kvalue->refcnt was a complicated task, as we had to
simultaneously keep track of two reference counts: one for the
reference count of bpf_map. When the kvalue->refcnt reaches zero, we
also have to reduce the reference count on bpf_map - yet these steps
are not performed in an atomic manner and require us to be vigilant
when managing them. By eliminating kvalue->refcnt, we can make our
maintenance more straightforward as the refcount of bpf_map is now
solely managed!

To prevent the trampoline image of a struct_ops from being released
while it is still in use, we wait for an RCU grace period. The
setsockopt(TCP_CONGESTION, "...") command allows you to change your
socket's congestion control algorithm and can result in releasing the
old struct_ops implementation.

If the setsockopt() above is referring to the syscall setsockopt(), then the old struct_ops is fine. The old struct_ops is protected by the struct_ops map's refcnt (or the current kvalue->refcnt). The sk in setsockopt(sk, ...) will no longer use the old struct_ops before the refcnt is decremented. This part should be the same as the tcp-cc kernel module.

Moreover, since this function is
exposed through bpf_setsockopt(), it may be accessed by BPF programs
as well. To ensure that the trampoline image belonging to struct_op
can be safely called while its method is in use, struct_ops is
safeguarded with rcu_read_lock(). Doing so prevents any destruction of
the associated images before returning from a trampoline and requires
us to wait for an RCU grace period.

The bpf_setsockopt(TCP_CONGESTION) is the reason that the trampoline image needs a grace period, but I noticed RCU grace period itself is not enough for trampoline image and more on this later.

Another reason the struct_ops map needs a RCU grace period is because of the bpf_try_module_get() (in tcp_set_default_congestion_control for example).


---
  include/linux/bpf.h         |  1 +
  kernel/bpf/bpf_struct_ops.c | 68 ++++++++++++++++++++-----------------
  kernel/bpf/syscall.c        |  6 ++--
  3 files changed, 42 insertions(+), 33 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e64ff1e89fb2..00ca92ea6f2e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1938,6 +1938,7 @@ struct bpf_map *bpf_map_get_with_uref(u32 ufd);
  struct bpf_map *__bpf_map_get(struct fd f);
  void bpf_map_inc(struct bpf_map *map);
  void bpf_map_inc_with_uref(struct bpf_map *map);
+struct bpf_map *__bpf_map_inc_not_zero(struct bpf_map *map, bool uref);
  struct bpf_map * __must_check bpf_map_inc_not_zero(struct bpf_map *map);
  void bpf_map_put_with_uref(struct bpf_map *map);
  void bpf_map_put(struct bpf_map *map);
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 38903fb52f98..ab7811a4c1dd 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -58,6 +58,11 @@ struct bpf_struct_ops_map {
  	struct bpf_struct_ops_value kvalue;
  };
+struct bpf_struct_ops_link {
+	struct bpf_link link;
+	struct bpf_map __rcu *map;
+};

Comparing with v5, this is moved from patch 3 to patch 1. It is not used here, so it belongs to patch 3.


@@ -574,6 +585,19 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
  {
  	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
+ /* The struct_ops's function may switch to another struct_ops.
+	 *
+	 * For example, bpf_tcp_cc_x->init() may switch to
+	 * another tcp_cc_y by calling
+	 * setsockopt(TCP_CONGESTION, "tcp_cc_y").
+	 * During the switch,  bpf_struct_ops_put(tcp_cc_x) is called
+	 * and its refcount may reach 0 which then free its
+	 * trampoline image while tcp_cc_x is still running.
+	 *
+	 * Thus, a rcu grace period is needed here.
+	 */
+	synchronize_rcu();

After the trampoline image finished running a struct_ops's "prog", it still has a few insn need to execute in the trampoline image, so it also needs to wait for synchronize_rcu_tasks/call_rcu_tasks.

This is an old issue, only happens when the struct_ops prog calls bpf_setsockopt(TCP_CONGESTION) with CONFIG_PREEMPT and unlikely other upcoming struct_ops subsystem may need this, please help to do a follow up fix on it (separate from this set) to also wait for the rcu_tasks gp.





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux