On 3/9/23 8:38 PM, Kui-Feng Lee wrote:
We have replaced kvalue-refcnt with synchronize_rcu() to wait for an
RCU grace period.
Maintenance of kvalue->refcnt was a complicated task, as we had to
simultaneously keep track of two reference counts: one for the
reference count of bpf_map. When the kvalue->refcnt reaches zero, we
also have to reduce the reference count on bpf_map - yet these steps
are not performed in an atomic manner and require us to be vigilant
when managing them. By eliminating kvalue->refcnt, we can make our
maintenance more straightforward as the refcount of bpf_map is now
solely managed!
To prevent the trampoline image of a struct_ops from being released
while it is still in use, we wait for an RCU grace period. The
setsockopt(TCP_CONGESTION, "...") command allows you to change your
socket's congestion control algorithm and can result in releasing the
old struct_ops implementation.
If the setsockopt() above is referring to the syscall setsockopt(), then the old
struct_ops is fine. The old struct_ops is protected by the struct_ops map's
refcnt (or the current kvalue->refcnt). The sk in setsockopt(sk, ...) will no
longer use the old struct_ops before the refcnt is decremented. This part should
be the same as the tcp-cc kernel module.
Moreover, since this function is
exposed through bpf_setsockopt(), it may be accessed by BPF programs
as well. To ensure that the trampoline image belonging to struct_op
can be safely called while its method is in use, struct_ops is
safeguarded with rcu_read_lock(). Doing so prevents any destruction of
the associated images before returning from a trampoline and requires
us to wait for an RCU grace period.
The bpf_setsockopt(TCP_CONGESTION) is the reason that the trampoline image needs
a grace period, but I noticed RCU grace period itself is not enough for
trampoline image and more on this later.
Another reason the struct_ops map needs a RCU grace period is because of the
bpf_try_module_get() (in tcp_set_default_congestion_control for example).
---
include/linux/bpf.h | 1 +
kernel/bpf/bpf_struct_ops.c | 68 ++++++++++++++++++++-----------------
kernel/bpf/syscall.c | 6 ++--
3 files changed, 42 insertions(+), 33 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e64ff1e89fb2..00ca92ea6f2e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1938,6 +1938,7 @@ struct bpf_map *bpf_map_get_with_uref(u32 ufd);
struct bpf_map *__bpf_map_get(struct fd f);
void bpf_map_inc(struct bpf_map *map);
void bpf_map_inc_with_uref(struct bpf_map *map);
+struct bpf_map *__bpf_map_inc_not_zero(struct bpf_map *map, bool uref);
struct bpf_map * __must_check bpf_map_inc_not_zero(struct bpf_map *map);
void bpf_map_put_with_uref(struct bpf_map *map);
void bpf_map_put(struct bpf_map *map);
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 38903fb52f98..ab7811a4c1dd 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -58,6 +58,11 @@ struct bpf_struct_ops_map {
struct bpf_struct_ops_value kvalue;
};
+struct bpf_struct_ops_link {
+ struct bpf_link link;
+ struct bpf_map __rcu *map;
+};
Comparing with v5, this is moved from patch 3 to patch 1. It is not used here,
so it belongs to patch 3.
@@ -574,6 +585,19 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
{
struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
+ /* The struct_ops's function may switch to another struct_ops.
+ *
+ * For example, bpf_tcp_cc_x->init() may switch to
+ * another tcp_cc_y by calling
+ * setsockopt(TCP_CONGESTION, "tcp_cc_y").
+ * During the switch, bpf_struct_ops_put(tcp_cc_x) is called
+ * and its refcount may reach 0 which then free its
+ * trampoline image while tcp_cc_x is still running.
+ *
+ * Thus, a rcu grace period is needed here.
+ */
+ synchronize_rcu();
After the trampoline image finished running a struct_ops's "prog", it still has
a few insn need to execute in the trampoline image, so it also needs to wait for
synchronize_rcu_tasks/call_rcu_tasks.
This is an old issue, only happens when the struct_ops prog calls
bpf_setsockopt(TCP_CONGESTION) with CONFIG_PREEMPT and unlikely other upcoming
struct_ops subsystem may need this, please help to do a follow up fix on it
(separate from this set) to also wait for the rcu_tasks gp.