Re: [PATCH bpf v2 4/5] bpf: Optimize the free of inner map

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Sat, 18 Nov 2023 21:04:37 +0800

Hi,

On 11/18/2023 3:34 PM, Martin KaFai Lau wrote:
> On 11/13/23 4:33 AM, Hou Tao wrote:
>> From: Hou Tao <houtao1@xxxxxxxxxx>
>>
>> When removing the inner map from the outer map, the inner map will be
>> freed after one RCU grace period and one RCU tasks trace grace
>> period, so it is certain that the bpf program which may access the inner
>> map, has exited before the inner map is freed.
>>
>> However there is unnecessary to wait for one RCU grace period or one RCU
>> tasks trace grace period if the outer map is only accessed by sleepable
>> program or non-sleepable program. So recording the context of the owned
>> bpf programs when adding map into env->used_maps and using the recorded
>> access context to decide which, and how many, RCU grace periods are
>> needed when freeing the inner map.
>>
>> Signed-off-by: Hou Tao <houtao1@xxxxxxxxxx>
>> ---
>>   include/linux/bpf.h     |  9 ++++++++-
>>   kernel/bpf/map_in_map.c | 18 +++++++++++++-----
>>   kernel/bpf/syscall.c    | 15 +++++++++++++--
>>   kernel/bpf/verifier.c   |  5 +++++
>>   4 files changed, 39 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index ec3c90202ffe6..8faa1af4b39df 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -245,6 +245,12 @@ struct bpf_list_node_kern {
>>       void *owner;
>>   } __attribute__((aligned(8)));
>>   +enum {
>> +    BPF_MAP_ACC_NORMAL_PROG_CTX = 1,
>> +    BPF_MAP_ACC_SLEEPABLE_PROG_CTX = 2,
>
> nit. It is a bit flag. Use (1U << 0) and (1U << 1) to make it more
> obvious.

Will fix.
>
> How about renaming this to BPF_MAP_RCU_GP and BPF_MAP_RCU_TT_GP to
> better reflect it is used to decide if the map_free needs to wait for
> any rcu gp?

OK. It is fine with me.
>
>> +    BPF_MAP_ACC_PROG_CTX_MASK = BPF_MAP_ACC_NORMAL_PROG_CTX |
>> BPF_MAP_ACC_SLEEPABLE_PROG_CTX,
>> +};
>> +
>>   struct bpf_map {
>>       /* The first two cachelines with read-mostly members of which some
>>        * are also accessed in fast-path (e.g. ops, max_entries).
>> @@ -292,7 +298,8 @@ struct bpf_map {
>>       } owner;
>>       bool bypass_spec_v1;
>>       bool frozen; /* write-once; write-protected by freeze_mutex */
>> -    bool free_after_mult_rcu_gp;
>> +    atomic_t owned_prog_ctx;
>
> Instead of the enum flags, this should only need a true/false value to
> tell whether the outer map has ever been used by a sleepable prog or
> not. may be renaming this to "atomic used_by_sleepable;"?

I have considered about that. But there is maps which is only accessed
in userspace (e.g. map used in BPF_PROG_BIND_MAP or maps in test_maps),
so I though one bool is not enough.
>
>> +    atomic_t may_be_accessed_prog_ctx;
>
> nit. rename this to "rcu_gp_flags;"

Will do.
>
>>       s64 __percpu *elem_count;
>>   };
>>   diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
>> index cf33630655661..e3d26a89ac5b6 100644
>> --- a/kernel/bpf/map_in_map.c
>> +++ b/kernel/bpf/map_in_map.c
>> @@ -131,12 +131,20 @@ void bpf_map_fd_put_ptr(struct bpf_map *map,
>> void *ptr, bool deferred)
>>   {
>>       struct bpf_map *inner_map = ptr;
>>   -    /* The inner map may still be used by both non-sleepable and
>> sleepable
>> -     * bpf program, so free it after one RCU grace period and one tasks
>> -     * trace RCU grace period.
>> +    /* Defer the freeing of inner map according to the owner program
>> +     * context of outer maps, so unnecessary multiple RCU GP waitings
>> +     * can be avoided.
>>        */
>> -    if (deferred)
>> -        WRITE_ONCE(inner_map->free_after_mult_rcu_gp, true);
>> +    if (deferred) {
>> +        /* owned_prog_ctx may be updated concurrently by new bpf
>> program
>> +         * so add smp_mb() below to ensure that reading owned_prog_ctx
>> +         * will return the newly-set bit when the new bpf program finds
>> +         * the inner map before it is removed from outer map.
>> +         */
>> +        smp_mb();
>
> This part took my head spinning a little, so it is better to ask. The
> owned_prog_ctx is set during verification time. There are many
> instructions till the prog is actually verified, attached (another
> syscall) and then run to do the actual lookup(&outer_map). Is this
> level of reordering practically possible?

Er, I added the memory barrier due to uncertainty. According to [1],
syscall doesn't imply a memory barrier. And If I understand correctly,
when the program is loaded and attached through bpf_sys_bpf, there will
be no new syscall.

[1]:
https://www.kernel.org/doc/html/latest/core-api/wrappers/memory-barriers.html

>
>> +        atomic_or(atomic_read(&map->owned_prog_ctx),
>> +              &inner_map->may_be_accessed_prog_ctx);
>> +    }
>>       bpf_map_put(inner_map);
>>   }
>>   diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index e2d2701ce2c45..5a7906f2b027e 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -694,12 +694,20 @@ static void bpf_map_free_deferred(struct
>> work_struct *work)
>>   {
>>       struct bpf_map *map = container_of(work, struct bpf_map, work);
>>       struct btf_record *rec = map->record;
>> +    int acc_ctx;
>>         security_bpf_map_free(map);
>>       bpf_map_release_memcg(map);
>>   -    if (READ_ONCE(map->free_after_mult_rcu_gp))
>> -        synchronize_rcu_mult(call_rcu, call_rcu_tasks_trace);
>> +    acc_ctx = atomic_read(&map->may_be_accessed_prog_ctx) &
>> BPF_MAP_ACC_PROG_CTX_MASK;
>
> The mask should not be needed.

Yep. Will remove it.
>
>> +    if (acc_ctx) {
>> +        if (acc_ctx == BPF_MAP_ACC_NORMAL_PROG_CTX)
>> +            synchronize_rcu();
>> +        else if (acc_ctx == BPF_MAP_ACC_SLEEPABLE_PROG_CTX)
>> +            synchronize_rcu_tasks_trace();
>> +        else
>> +            synchronize_rcu_mult(call_rcu, call_rcu_tasks_trace);
>
> Is it better to add a rcu_head to the map and then use call_rcu_().
> e.g. when there is many delete happened to the outer map during a
> process restart to re-populate the outer map. It is relatively much
> cheaper to add a rcu_head to the map comparing to adding one for each
> elem. wdyt?

Good idea. call_rcu() will be much cheaper than synchronize_rcu(). Will do.
>
>> +    }
>>         /* implementation dependent freeing */
>>       map->ops->map_free(map);
>> @@ -5326,6 +5334,9 @@ static int bpf_prog_bind_map(union bpf_attr *attr)
>>           goto out_unlock;
>>       }
>>   +    /* No need to update owned_prog_ctx, because the bpf program
>> doesn't
>> +     * access the map.
>> +     */
>>       memcpy(used_maps_new, used_maps_old,
>>              sizeof(used_maps_old[0]) * prog->aux->used_map_cnt);
>>       used_maps_new[prog->aux->used_map_cnt] = map;
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index bd1c42eb540f1..d8d5432b240dc 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -18012,12 +18012,17 @@ static int resolve_pseudo_ldimm64(struct
>> bpf_verifier_env *env)
>>                   return -E2BIG;
>>               }
>>   +            atomic_or(env->prog->aux->sleepable ?
>> BPF_MAP_ACC_SLEEPABLE_PROG_CTX :
>> +                                  BPF_MAP_ACC_NORMAL_PROG_CTX,
>> +                  &map->owned_prog_ctx);
>>               /* hold the map. If the program is rejected by verifier,
>>                * the map will be released by release_maps() or it
>>                * will be used by the valid program until it's unloaded
>>                * and all maps are released in free_used_maps()
>>                */
>>               bpf_map_inc(map);
>> +            /* Paired with smp_mb() in bpf_map_fd_put_ptr() */
>> +            smp_mb__after_atomic();
>>                 aux->map_index = env->used_map_cnt;
>>               env->used_maps[env->used_map_cnt++] = map;