Re: [RFC PATCH v3 0/3] Introduce BPF map tracing capability

Yonghong Song <yhs@xxxxxx> · Wed, 3 Nov 2021 10:29:28 -0700

On 11/2/21 5:12 PM, Alexei Starovoitov wrote:
On Tue, Nov 02, 2021 at 02:14:29AM +0000, Joe Burton wrote:
From: Joe Burton <jevburton@xxxxxxxxxx>

This is the third version of a patch series implementing map tracing.

Map tracing enables executing BPF programs upon BPF map updates. This
might be useful to perform upgrades of stateful programs; e.g., tracing
programs can propagate changes to maps that occur during an upgrade
operation.

This version uses trampoline hooks to provide the capability.
fentry/fexit/fmod_ret programs can attach to two new functions:
         int bpf_map_trace_update_elem(struct bpf_map* map, void* key,
                 void* val, u32 flags);
         int bpf_map_trace_delete_elem(struct bpf_map* map, void* key);

These hooks work as intended for the following map types:
         BPF_MAP_TYPE_ARRAY
         BPF_MAP_TYPE_PERCPU_ARRAY
         BPF_MAP_TYPE_HASH
         BPF_MAP_TYPE_PERCPU_HASH
         BPF_MAP_TYPE_LRU_HASH
         BPF_MAP_TYPE_LRU_PERCPU_HASH

The only guarantee about the semantics of these hooks is that they execute
before the operation takes place. We cannot call them with locks held
because the hooked program might try to acquire the same locks. Thus they
may be invoked in situations where the traced map is not ultimately
updated.

The original proposal suggested exposing a function for each
(map type) x (access type). The problem I encountered is that e.g.
percpu hashtables use a custom function for some access types
(htab_percpu_map_update_elem) but a common function for others
(htab_map_delete_elem). Thus a userspace application would have to
maintain a unique list of functions to attach to for each map type;
moreover, this list could change across kernel versions. Map tracing is
easier to use with fewer functions, at the cost of tracing programs
being triggered more times.

Good point about htab_percpu.
The patches look good to me.
Few minor bits:
- pls don't use #pragma once.
   There was a discussion not too long ago about it and the conclusion
   was that let's not use it.
   It slipped into few selftest/bpf, but let's not introduce more users.
- noinline is not needed in prototype.
- bpf_probe_read is deprecated. Pls use bpf_probe_read_kernel.

and thanks for detailed patch 3.

To prevent the compiler from optimizing out the calls to my tracing
functions, I use the asm("") trick described in gcc's
__attribute__((noinline)) documentation. Experimentally, this trick
works with clang as well.

I think noinline is enough. I don't think you need that asm in there.

I tried a simple program using clang lto and the optimization 
(optimizing away the call itself) doesn't happen.

[$ ~/tmp2] cat t1.c 

__attribute__((noinline)) int foo() { 

  return 0; 

} 

[$ ~/tmp2] cat t2.c 

extern int foo(); 

int main() { 

  return foo(); 

} 

[$ ~/tmp2] cat run.sh 

clang -flto=full -O2 t1.c t2.c -c 

clang -flto=full -fuse-ld=lld -O2 t1.o t2.o -o a.out 

[$ ~/tmp2] ./run.sh 

[$ ~/tmp2] llvm-objdump -d a.out
...
0000000000201750 <foo>:
  201750: 31 c0                         xorl    %eax, %eax
  201752: c3                            retq
  201753: cc                            int3
  201754: cc                            int3
  201755: cc                            int3
  201756: cc                            int3
  201757: cc                            int3
  201758: cc                            int3
  201759: cc                            int3
  20175a: cc                            int3
  20175b: cc                            int3
  20175c: cc                            int3
  20175d: cc                            int3
  20175e: cc                            int3
  20175f: cc                            int3

0000000000201760 <main>:
  201760: e9 eb ff ff ff                jmp     0x201750 <foo>

I remember that even if a call is marked as noinline, the compiler might 
still poke into the call to find some information for some optimization.
But I guess probably the callsite will be kept. Otherwise, it will be
considered as "inlining".

Joe, did you hit any issues, esp. with gcc lto?

In parallel let's figure out how to do:
SEC("fentry/bpf_map_trace_update_elem")
int BPF_PROG(copy_on_write__update,
              struct bpf_map *map,
              struct allow_reads_key__old *key,
              void *value, u64 map_flags)

It kinda sucks that bpf_probe_read_kernel is necessary to read key/values.
It would be much nicer to be able to specify the exact struct for the key and
access it directly.
The verifier does this already for map iterator.
It's 'void *' on the kernel side while iterator prog can cast this pointer
to specific 'struct key *' and access it directly.
See bpf_iter_reg->ctx_arg_info and btf_ctx_access().

For fentry into bpf_map_trace_update_elem it's a bit more challenging,
since it will be called for all maps and there is no way to statically
check that specific_map->key_size is within prog->aux->max_rdonly_access.

May be we can do a dynamic cast helper (simlar to those that cast sockets)
that will check for key_size at run-time?
Another alternative is to allow 'void *' -> PTR_TO_BTF_ID conversion
and let inlined probe_read do the job.