Re: [PATCH bpf v2 4/5] bpf: Optimize the free of inner map

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On 11/19/2023 7:28 AM, Martin KaFai Lau wrote:
> On 11/18/23 5:04 AM, Hou Tao wrote:
>>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>>> index ec3c90202ffe6..8faa1af4b39df 100644
>>>> --- a/include/linux/bpf.h
>>>> +++ b/include/linux/bpf.h
>>>> @@ -245,6 +245,12 @@ struct bpf_list_node_kern {
>>>>        void *owner;
>>>>    } __attribute__((aligned(8)));
>>>>    +enum {
>>>> +    BPF_MAP_ACC_NORMAL_PROG_CTX = 1,
>>>> +    BPF_MAP_ACC_SLEEPABLE_PROG_CTX = 2,
>>>
>>> nit. It is a bit flag. Use (1U << 0) and (1U << 1) to make it more
>>> obvious.
>>
>> Will fix.
>>>
>>> How about renaming this to BPF_MAP_RCU_GP and BPF_MAP_RCU_TT_GP to
>>> better reflect it is used to decide if the map_free needs to wait for
>>> any rcu gp?
>>
>> OK. It is fine with me.
>>>
>>>> +    BPF_MAP_ACC_PROG_CTX_MASK = BPF_MAP_ACC_NORMAL_PROG_CTX |
>>>> BPF_MAP_ACC_SLEEPABLE_PROG_CTX,
>>>> +};
>>>> +
>>>>    struct bpf_map {
>>>>        /* The first two cachelines with read-mostly members of
>>>> which some
>>>>         * are also accessed in fast-path (e.g. ops, max_entries).
>>>> @@ -292,7 +298,8 @@ struct bpf_map {
>>>>        } owner;
>>>>        bool bypass_spec_v1;
>>>>        bool frozen; /* write-once; write-protected by freeze_mutex */
>>>> -    bool free_after_mult_rcu_gp;
>>>> +    atomic_t owned_prog_ctx;
>>>
>>> Instead of the enum flags, this should only need a true/false value to
>>> tell whether the outer map has ever been used by a sleepable prog or
>>> not. may be renaming this to "atomic used_by_sleepable;"?
>>
>> I have considered about that. But there is maps which is only accessed
>> in userspace (e.g. map used in BPF_PROG_BIND_MAP or maps in test_maps),
>> so I though one bool is not enough.
>
> Good point.
>
> In this case, a nit on the name. The "prog_ctx" part in the
> owned_prog_ctx usually means a different thing, like the "struct
> __sk_buff". How about using this pair of names?
>
>      /* may be just "long" and then set_bit? */
>     atomic_t used_in_rcu_gp;
>     atomic_t free_by_rcu_gp;
>
> shortened the name by removing the "_flags" suggested in the earlier
> comment.

Will update the names. The naming is hard, so thanks for the suggestions.

I have tried set_bit() and test_bit() . The reason I switched to
atomic_or() is that the implementation will be much cleaner in
bpf_map_fd_put_ptr(). If using set_bit(), I had to do the following
things in bpf_map_fd_put_ptr() to atomically assign the bits in
used_in_rcu_gp to free_by_rcu_gp:

used = READ_ONCE(used_in_rcu_gp);
if (test_bit(BPF_MAP_RCU_GP, &used))
    set_bit(BPF_MAP_RCU_GP, &free_by_rcu_gp);
if (test_bit(BPF_MAP_RCU_TT_GP , &used)
    set_bit(BPF_MAP_RCU_TT_GP, &free_by_rcu_gp);

But for atomic_or(), only two statements are needed:

atomic_or(atomic_read(&used_in_rcu_gp), &free_by_rcu_gp)

I could use set_bit/test_bit() if it is preferred.

>
>>>
>>>> +    atomic_t may_be_accessed_prog_ctx;
>>>
>>> nit. rename this to "rcu_gp_flags;"
>>
>> Will do.
>>>
>>>>        s64 __percpu *elem_count;
>>>>    };
>>>>    diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
>>>> index cf33630655661..e3d26a89ac5b6 100644
>>>> --- a/kernel/bpf/map_in_map.c
>>>> +++ b/kernel/bpf/map_in_map.c
>>>> @@ -131,12 +131,20 @@ void bpf_map_fd_put_ptr(struct bpf_map *map,
>>>> void *ptr, bool deferred)
>>>>    {
>>>>        struct bpf_map *inner_map = ptr;
>>>>    -    /* The inner map may still be used by both non-sleepable and
>>>> sleepable
>>>> -     * bpf program, so free it after one RCU grace period and one
>>>> tasks
>>>> -     * trace RCU grace period.
>>>> +    /* Defer the freeing of inner map according to the owner program
>>>> +     * context of outer maps, so unnecessary multiple RCU GP waitings
>>>> +     * can be avoided.
>>>>         */
>>>> -    if (deferred)
>>>> -        WRITE_ONCE(inner_map->free_after_mult_rcu_gp, true);
>>>> +    if (deferred) {
>>>> +        /* owned_prog_ctx may be updated concurrently by new bpf
>>>> program
>>>> +         * so add smp_mb() below to ensure that reading
>>>> owned_prog_ctx
>>>> +         * will return the newly-set bit when the new bpf program
>>>> finds
>>>> +         * the inner map before it is removed from outer map.
>>>> +         */
>>>> +        smp_mb();
>>>
>>> This part took my head spinning a little, so it is better to ask. The
>>> owned_prog_ctx is set during verification time. There are many
>>> instructions till the prog is actually verified, attached (another
>>> syscall) and then run to do the actual lookup(&outer_map). Is this
>>> level of reordering practically possible?
>>
>> Er, I added the memory barrier due to uncertainty. According to [1],
> My immediate thought was a more straight forward spin lock instead.
>
> Agree that smp_mb() works as well, ok.

I see. I will keep the smp_mb().
>
>> syscall doesn't imply a memory barrier. And If I understand correctly,
>> when the program is loaded and attached through bpf_sys_bpf, there will
>> be no new syscall. >
>> [1]:
>> https://www.kernel.org/doc/html/latest/core-api/wrappers/memory-barriers.html
>>
>>
>>>
>>>> +        atomic_or(atomic_read(&map->owned_prog_ctx),
>>>> +              &inner_map->may_be_accessed_prog_ctx);
>>>> +    }
>>>>        bpf_map_put(inner_map);
>>>>    }
>>>>    diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>> index e2d2701ce2c45..5a7906f2b027e 100644
>>>> --- a/kernel/bpf/syscall.c
>>>> +++ b/kernel/bpf/syscall.c
>>>> @@ -694,12 +694,20 @@ static void bpf_map_free_deferred(struct
>>>> work_struct *work)
>>>>    {
>>>>        struct bpf_map *map = container_of(work, struct bpf_map, work);
>>>>        struct btf_record *rec = map->record;
>>>> +    int acc_ctx;
>>>>          security_bpf_map_free(map);
>>>>        bpf_map_release_memcg(map);
>>>>    -    if (READ_ONCE(map->free_after_mult_rcu_gp))
>>>> -        synchronize_rcu_mult(call_rcu, call_rcu_tasks_trace);
>>>> +    acc_ctx = atomic_read(&map->may_be_accessed_prog_ctx) &
>>>> BPF_MAP_ACC_PROG_CTX_MASK;
>>>
>>> The mask should not be needed.
>>
>> Yep. Will remove it.
>>>
>>>> +    if (acc_ctx) {
>>>> +        if (acc_ctx == BPF_MAP_ACC_NORMAL_PROG_CTX)
>>>> +            synchronize_rcu();
>>>> +        else if (acc_ctx == BPF_MAP_ACC_SLEEPABLE_PROG_CTX)
>>>> +            synchronize_rcu_tasks_trace();
>>>> +        else
>>>> +            synchronize_rcu_mult(call_rcu, call_rcu_tasks_trace);
>>>
>>> Is it better to add a rcu_head to the map and then use call_rcu_().
>>> e.g. when there is many delete happened to the outer map during a
>>> process restart to re-populate the outer map. It is relatively much
>>> cheaper to add a rcu_head to the map comparing to adding one for each
>>> elem. wdyt?
>>
>> Good idea. call_rcu() will be much cheaper than synchronize_rcu().
>> Will do.
>>>
>>>> +    }
>>>>          /* implementation dependent freeing */
>>>>        map->ops->map_free(map);
>>>> @@ -5326,6 +5334,9 @@ static int bpf_prog_bind_map(union bpf_attr
>>>> *attr)
>>>>            goto out_unlock;
>>>>        }
>>>>    +    /* No need to update owned_prog_ctx, because the bpf program
>>>> doesn't
>>>> +     * access the map.
>>>> +     */
>>>>        memcpy(used_maps_new, used_maps_old,
>>>>               sizeof(used_maps_old[0]) * prog->aux->used_map_cnt);
>>>>        used_maps_new[prog->aux->used_map_cnt] = map;
>>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>>>> index bd1c42eb540f1..d8d5432b240dc 100644
>>>> --- a/kernel/bpf/verifier.c
>>>> +++ b/kernel/bpf/verifier.c
>>>> @@ -18012,12 +18012,17 @@ static int resolve_pseudo_ldimm64(struct
>>>> bpf_verifier_env *env)
>>>>                    return -E2BIG;
>>>>                }
>>>>    +            atomic_or(env->prog->aux->sleepable ?
>>>> BPF_MAP_ACC_SLEEPABLE_PROG_CTX :
>>>> +                                  BPF_MAP_ACC_NORMAL_PROG_CTX,
>>>> +                  &map->owned_prog_ctx);
>
> This is setting a bit. Would set_bit() work as good?

Please see my reply above.

>
>>>>                /* hold the map. If the program is rejected by
>>>> verifier,
>>>>                 * the map will be released by release_maps() or it
>>>>                 * will be used by the valid program until it's
>>>> unloaded
>>>>                 * and all maps are released in free_used_maps()
>>>>                 */
>>>>                bpf_map_inc(map);
>>>> +            /* Paired with smp_mb() in bpf_map_fd_put_ptr() */
>>>> +            smp_mb__after_atomic();
>
> To make it clearer, can this be moved immediately after the set_bit /
> atomic_or above?

Will do. Thanks for all the suggestions.
>
>>>>                  aux->map_index = env->used_map_cnt;
>>>>                env->used_maps[env->used_map_cnt++] = map;
>>





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux