Re: [PATCH bpf v3 4/6] bpf: Optimize the free of inner map

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Nov 24, 2023 at 3:29 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>
> From: Hou Tao <houtao1@xxxxxxxxxx>
>
> When removing the inner map from the outer map, the inner map will be
> freed after one RCU grace period and one RCU tasks trace grace
> period, so it is certain that the bpf program, which may access the
> inner map, has exited before the inner map is freed.
>
> However there is unnecessary to wait for any RCU grace period, one RCU
> grace period or one RCU tasks trace grace period if the outer map is
> only accessed by userspace, sleepable program or non-sleepable program.
> So recording the sleepable attributes of the owned bpf programs when
> adding the outer map into env->used_maps, copying the recorded
> attributes to inner map atomically when removing inner map from the
> outer map and using the recorded attributes in the inner map to decide
> which, and how many, RCU grace periods are needed when freeing the
> inner map.
>
> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx>
> ---
>  include/linux/bpf.h     |  8 +++++++-
>  kernel/bpf/map_in_map.c | 19 ++++++++++++++-----
>  kernel/bpf/syscall.c    | 15 +++++++++++++--
>  kernel/bpf/verifier.c   |  4 ++++
>  4 files changed, 38 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 15a6bb951b70..c5b549f352d7 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -245,6 +245,11 @@ struct bpf_list_node_kern {
>         void *owner;
>  } __attribute__((aligned(8)));
>
> +enum {
> +       BPF_MAP_RCU_GP = BIT(0),
> +       BPF_MAP_RCU_TT_GP = BIT(1),
> +};
> +
>  struct bpf_map {
>         /* The first two cachelines with read-mostly members of which some
>          * are also accessed in fast-path (e.g. ops, max_entries).
> @@ -296,7 +301,8 @@ struct bpf_map {
>         } owner;
>         bool bypass_spec_v1;
>         bool frozen; /* write-once; write-protected by freeze_mutex */
> -       bool free_after_mult_rcu_gp;
> +       atomic_t used_in_rcu_gp;
> +       atomic_t free_by_rcu_gp;
>         s64 __percpu *elem_count;
>  };
>
> diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
> index cf3363065566..d044ee677107 100644
> --- a/kernel/bpf/map_in_map.c
> +++ b/kernel/bpf/map_in_map.c
> @@ -131,12 +131,21 @@ void bpf_map_fd_put_ptr(struct bpf_map *map, void *ptr, bool deferred)
>  {
>         struct bpf_map *inner_map = ptr;
>
> -       /* The inner map may still be used by both non-sleepable and sleepable
> -        * bpf program, so free it after one RCU grace period and one tasks
> -        * trace RCU grace period.
> +       /* Defer the freeing of inner map according to the attribute of bpf
> +        * program which owns the outer map, so unnecessary multiple RCU GP
> +        * waitings can be avoided.
>          */
> -       if (deferred)
> -               WRITE_ONCE(inner_map->free_after_mult_rcu_gp, true);
> +       if (deferred) {
> +               /* used_in_rcu_gp may be updated concurrently by new bpf
> +                * program, so add smp_mb() to guarantee the order between
> +                * used_in_rcu_gp and lookup/deletion operation of inner map.
> +                * If a new bpf program finds the inner map before it is
> +                * removed from outer map, reading used_in_rcu_gp below will
> +                * return the newly-set bit set by the new bpf program.
> +                */
> +               smp_mb();
> +               atomic_or(atomic_read(&map->used_in_rcu_gp), &inner_map->free_by_rcu_gp);

You resent the patches before I had time to reply to the previous thread...

> I think the main reason is that there is four possible case for the free
> of inner map:
> (1) neither call synchronize_rcu() nor synchronize_rcu_tasks_trace()
> It is true when the outer map is only being accessed in user space.
> (2) only call synchronize_rcu()
> the outer map is only being accessed by non-sleepable bpf program
> (3) only call synchronize_rcu_tasks_trace
> the outer map is only being accessed by sleepable bpf program
> (4) call both synchronize_rcu() and synchronize_rcu_tasks_trace()
>
> Only using sleepable_refcnt can not express 4 possible cases and we also
> need to atomically copy the states from outer map to inner map, because
> one inner map may be used concurrently by multiple outer map, so atomic
> or mask are chosen.

We don't care about optimizing 1, since it's rare case.
We also don't care about optimizing 3, since sync_rcu time is negligible
when we need to wait for sync_rcu_tasks_trace and also
because rcu_trace_implies_rcu_gp() for foreseeable future.

> need to atomically

we do NOT have such need.
There is zero need to do this atomic games and barries "just want to
be cautious". The code should not have any code at all "to be
cautious".
Every barrier has to be a real reason behind it.
Please remove them.
The concurent access to refcnt and sleepable_refcnt can be serialized
with simple spin_lock.

I also don't like
> +       BPF_MAP_RCU_GP = BIT(0),
> +       BPF_MAP_RCU_TT_GP = BIT(1),

the names should not describe what action should be taken.
The names should describe what the bits do.
I still think sleepable_refcnt and refcnt is more than enough to
optimize patch 3.





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux