Re: [RFC PATCH bpf-next 1/2] bpf: Introduce global percpu data

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On 15/1/25 07:10, Andrii Nakryiko wrote:
> On Mon, Jan 13, 2025 at 7:25 AM Leon Hwang <leon.hwang@xxxxxxxxx> wrote:
>>
>> This patch introduces global per-CPU data, inspired by commit
>> 6316f78306c1 ("Merge branch 'support-global-data'"). It enables the
>> definition of global per-CPU variables in BPF, similar to the
>> DEFINE_PER_CPU() macro in the kernel[0].
>>
>> For example, in BPF, it is able to define a global per-CPU variable like
>> this:
>>
>> int percpu_data SEC(".data..percpu");
>>
>> With this patch, tools like retsnoop[1] and bpflbr[2] can simplify their
>> BPF code for handling LBRs. The code can be updated from
>>
>> static struct perf_branch_entry lbrs[1][MAX_LBR_ENTRIES] SEC(".data.lbrs");
>>
>> to
>>
>> static struct perf_branch_entry lbrs[MAX_LBR_ENTRIES] SEC(".data..percpu.lbrs");
>>
>> This eliminates the need to retrieve the CPU ID using the
>> bpf_get_smp_processor_id() helper.
>>
>> Additionally, by reusing global per-CPU variables, sharing information
>> between tail callers and callees or freplace callers and callees becomes
>> simpler compared to using percpu_array maps.
>>
>> Links:
>> [0] https://github.com/torvalds/linux/blob/fbfd64d25c7af3b8695201ebc85efe90be28c5a3/include/linux/percpu-defs.h#L114
>> [1] https://github.com/anakryiko/retsnoop
>> [2] https://github.com/Asphaltt/bpflbr
>>
>> Signed-off-by: Leon Hwang <leon.hwang@xxxxxxxxx>
>> ---
>>  kernel/bpf/arraymap.c  |  39 +++++++++++++-
>>  kernel/bpf/verifier.c  |  45 +++++++++++++++++
>>  tools/lib/bpf/libbpf.c | 112 ++++++++++++++++++++++++++++++++---------
>>  3 files changed, 171 insertions(+), 25 deletions(-)
>>
> 
> So I think the feature overall makes sense, but we need to think
> through at least libbpf's side of things some more. Unlike .data,
> per-cpu .data section is not mmapable, and so that has implication on
> BPF skeleton and we should make sure all that makes sense on BPF
> skeleton side. In that sense, per-cpu global data is more akin to
> struct_ops initialization image, which can be accessed by user before
> skeleton is loaded to initialize the image.
> 
> There are a few things to consider. What's the BPF skeleton interface?
> Do we expose it as single struct and use that struct as initial image
> for each CPU (which means user won't be able to initialize different
> CPU data differently, at least not through BPF skeleton facilities)?
> Or do we expose this as an array of structs and let user set each CPU
> data independently?
> 
> I feel like keeping it simple and having one image for all CPUs would
> cover most cases. And users can still access the underlying
> PERCPU_ARRAY map if they need more control.

Agree. It is necessary to keep it simple.

> 
> But either way, we need tests for skeleton, making sure we NULL-out
> this per-cpu global data, but take it into account before the load.
> 
> Also, this huge calloc for possible CPUs, I'd like to avoid it
> altogether for the (probably very common) zero-initialized case.

Ack.

> 
> So in short, needs a bit of iteration to figure out all the
> interfacing issues, but makes sense overall. See some more low-level
> remarks below.
> 

It is challenging to figure out them. I'll do my best to achieve it.

> pw-bot: cr
> 
> 
> [...]
> 
>> @@ -815,6 +850,8 @@ const struct bpf_map_ops percpu_array_map_ops = {
>>         .map_get_next_key = array_map_get_next_key,
>>         .map_lookup_elem = percpu_array_map_lookup_elem,
>>         .map_gen_lookup = percpu_array_map_gen_lookup,
>> +       .map_direct_value_addr = percpu_array_map_direct_value_addr,
>> +       .map_direct_value_meta = percpu_array_map_direct_value_meta,
>>         .map_update_elem = array_map_update_elem,
>>         .map_delete_elem = array_map_delete_elem,
>>         .map_lookup_percpu_elem = percpu_array_map_lookup_percpu_elem,
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index b8ca227c78af1..94ce02a48ddc1 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -6809,6 +6809,8 @@ static int bpf_map_direct_read(struct bpf_map *map, int off, int size, u64 *val,
>>         u64 addr;
>>         int err;
>>
>> +       if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY)
>> +               return -EINVAL;
> 
> I'd invert the condition and reject anything by BPF_MAP_TYPE_ARRAY. I
> don't see how any other possible map type can support this magic that
> we do with ARRAY. So to be on the safe side, let's just reject
> everything that's non-ARRAY (here and elsewhere)?

Ack.

> 
>>         err = map->ops->map_direct_value_addr(map, &addr, off);
>>         if (err)
>>                 return err;
> 
> [...]
> 
>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>> index 6c262d0152f81..881174f4f90a4 100644
>> --- a/tools/lib/bpf/libbpf.c
>> +++ b/tools/lib/bpf/libbpf.c
> 
> definitely let's split out libbpf changes from kernel-side changes

Ack.

> 
>> @@ -516,6 +516,7 @@ struct bpf_struct_ops {
>>  };
>>
>>  #define DATA_SEC ".data"
>> +#define PERCPU_DATA_SEC ".data..percpu"
> 
> I don't like this prefix, even if that's what we have in the kernel.
> Something like just ".percpu" or ".percpu_data" or ".data_percpu" is
> better, IMO.

I tested ".percpu". It is OK to use it. But we have to update "bpftool
gen" too, which relies on these section names.

Is it better to keep ".data" prefix, like ".data.percpu", ".data_percpu"?
Can keeping ".data" prefix reduce some works on bpftool, go-ebpf and
akin bpf loaders?

> 
>>  #define BSS_SEC ".bss"
>>  #define RODATA_SEC ".rodata"
>>  #define KCONFIG_SEC ".kconfig"
>> @@ -562,6 +563,8 @@ struct bpf_map {
>>         __u32 btf_value_type_id;
>>         __u32 btf_vmlinux_value_type_id;
>>         enum libbpf_map_type libbpf_type;
>> +       int num_cpus;
>> +       void *data;
>>         void *mmaped;
>>         struct bpf_struct_ops *st_ops;
>>         struct bpf_map *inner_map;
>> @@ -1923,11 +1926,35 @@ static bool map_is_mmapable(struct bpf_object *obj, struct bpf_map *map)
>>         return false;
>>  }
>>
>> +static bool map_is_percpu_data(struct bpf_map *map)
>> +{
>> +       return str_has_pfx(map->real_name, PERCPU_DATA_SEC);
>> +}
> 
> we have enum libbpf_map_type which is used to distinguish BSS vs
> RODATA vs DATA vs others, let's stick to that
> 

Ack.

>> +
>> +static void map_copy_data(struct bpf_map *map, const void *data)
>> +{
>> +       bool is_percpu_data = map_is_percpu_data(map);
>> +       size_t data_sz = map->def.value_size;
>> +       size_t elem_sz = roundup(data_sz, 8);
>> +       int i;
>> +
>> +       if (!data)
>> +               return;
>> +
>> +       if (!is_percpu_data)
>> +               memcpy(map->mmaped, data, data_sz);
>> +       else
>> +               for (i = 0; i < map->num_cpus; i++)
>> +                       memcpy(map->data + i*elem_sz, data, data_sz);
>> +}
>> +
>>  static int
>>  bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
>>                               const char *real_name, int sec_idx, void *data, size_t data_sz)
>>  {
>> +       bool is_percpu_data = str_has_pfx(real_name, PERCPU_DATA_SEC);
>>         struct bpf_map_def *def;
>> +       const char *data_desc;
>>         struct bpf_map *map;
>>         size_t mmap_sz;
>>         int err;
>> @@ -1948,7 +1975,8 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
>>         }
>>
>>         def = &map->def;
>> -       def->type = BPF_MAP_TYPE_ARRAY;
>> +       def->type = is_percpu_data ? BPF_MAP_TYPE_PERCPU_ARRAY
>> +                                  : BPF_MAP_TYPE_ARRAY;
>>         def->key_size = sizeof(int);
>>         def->value_size = data_sz;
>>         def->max_entries = 1;
>> @@ -1958,29 +1986,57 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
>>         /* failures are fine because of maps like .rodata.str1.1 */
>>         (void) map_fill_btf_type_info(obj, map);
>>
>> -       if (map_is_mmapable(obj, map))
>> -               def->map_flags |= BPF_F_MMAPABLE;
>> +       data_desc = is_percpu_data ? "percpu " : "";
> 
> nit: just inline the logic in that single place you use it
> 

Ack.

>> +       pr_debug("map '%s' (global %sdata): at sec_idx %d, offset %zu, flags %x.\n",
>> +                map->name, data_desc, map->sec_idx, map->sec_offset,
>> +                def->map_flags);
>>
>> -       pr_debug("map '%s' (global data): at sec_idx %d, offset %zu, flags %x.\n",
>> -                map->name, map->sec_idx, map->sec_offset, def->map_flags);
>> +       if (is_percpu_data) {
>> +               map->num_cpus = libbpf_num_possible_cpus();
>> +               if (map->num_cpus < 0) {
>> +                       err = errno;
> 
> map->num_cpus should have actual error value already, avoid using
> errno unnecessarily
> 

Ack.

>> +                       pr_warn("failed to get possible cpus\n");
>> +                       goto free_name;
>> +               }
>>
>> -       mmap_sz = bpf_map_mmap_sz(map);
>> -       map->mmaped = mmap(NULL, mmap_sz, PROT_READ | PROT_WRITE,
>> -                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);
>> -       if (map->mmaped == MAP_FAILED) {
>> -               err = -errno;
>> -               map->mmaped = NULL;
>> -               pr_warn("failed to alloc map '%s' content buffer: %s\n", map->name, errstr(err));
>> -               zfree(&map->real_name);
>> -               zfree(&map->name);
>> -               return err;
>> -       }
>> +               map->data = calloc(map->num_cpus, roundup(data_sz, 8));
>> +               if (!map->data) {
>> +                       err = -ENOMEM;
>> +                       pr_warn("failed to alloc percpu map '%s' content buffer: %s\n",
>> +                               map->name, errstr(err));
> 
> please stick to establish reporting style (i.e., for map operations we
> always use "map '%s': " prefix)
> 

Ack.

>> +                       goto free_name;
>> +               }
> 
> [...]
> 
>> @@ -10370,7 +10434,7 @@ void *bpf_map__initial_value(const struct bpf_map *map, size_t *psize)
>>                 return map->st_ops->data;
>>         }
>>
>> -       if (!map->mmaped)
>> +       if ((!map->mmaped && !map->data))
> 
> nit: why unnecessary extra () ?
> 

My bad. I'll remove it.

> 
>>                 return NULL;
>>
>>         if (map->def.type == BPF_MAP_TYPE_ARENA)
>> @@ -10378,7 +10442,7 @@ void *bpf_map__initial_value(const struct bpf_map *map, size_t *psize)
>>         else
>>                 *psize = map->def.value_size;
>>
>> -       return map->mmaped;
>> +       return map->def.type == BPF_MAP_TYPE_PERCPU_ARRAY ? map->data : map->mmaped;
>>  }
>>
>>  bool bpf_map__is_internal(const struct bpf_map *map)
>> --
>> 2.47.1
>>

Thanks,
Leon





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux