Re: [RFC PATCH v3 0/3] Introduce BPF map tracing capability

Joe Burton <jevburton.kernel@xxxxxxxxx> · Wed, 3 Nov 2021 10:45:12 -0700

Sort of - I hit issues when defining the function in the same
compilation unit as the call site. For example:

  static noinline int bpf_array_map_trace_update(struct bpf_map *map,
                void *key, void *value, u64 map_flags)
  {
        return 0;
  }
  ALLOW_ERROR_INJECTION(bpf_array_map_trace_update, ERRNO);

  /* Called from syscall or from eBPF program */
  static int array_map_update_elem(struct bpf_map *map, void *key,
                void *value, u64 map_flags)
  {
        ...
                /* This call is elided entirely! */
                if (unlikely(err = bpf_array_map_trace_update(map, key,
                                                value, map_flags)))
                        return err;

... I observed that the call was elided with both gcc and clang. In
either case, the function itself is left behind and can be attached to
with a trampoline prog, but the function is never invoked. Putting
the function body into its own compilation unit sidesteps the issue - I
suspect that LTO isn't clever enough to elide the call.

FWIW, I saw this in the objdump of `array_map_update_elem' when I compiled
this code:

  90:   e8 6b ff ff ff          call   0 \
      <bpf_array_map_trace_update.constprop.0>

So I suspect that constant propagation is responsible for getting rid of
the call site.

The gcc docs for __attribute__((noinline)) call out the asm("") trick to
avoid inlining:

        noinline
        This function attribute prevents a function from being
        considered for inlining. If the function does not have side-
        effects, there are optimizations other than inlining that
        causes function calls to be optimized away, although the
        function call is live. To keep such calls from being optimized
        away, put
                  asm ("");

Since putting the func in its own compilation unit sidesteps the issue,
I'm content to remove the asm("").

On Wed, Nov 3, 2021 at 10:29 AM Yonghong Song <yhs@xxxxxx> wrote:
>
>
>
> On 11/2/21 5:12 PM, Alexei Starovoitov wrote:
> > On Tue, Nov 02, 2021 at 02:14:29AM +0000, Joe Burton wrote:
> >> From: Joe Burton <jevburton@xxxxxxxxxx>
> >>
> >> This is the third version of a patch series implementing map tracing.
> >>
> >> Map tracing enables executing BPF programs upon BPF map updates. This
> >> might be useful to perform upgrades of stateful programs; e.g., tracing
> >> programs can propagate changes to maps that occur during an upgrade
> >> operation.
> >>
> >> This version uses trampoline hooks to provide the capability.
> >> fentry/fexit/fmod_ret programs can attach to two new functions:
> >>          int bpf_map_trace_update_elem(struct bpf_map* map, void* key,
> >>                  void* val, u32 flags);
> >>          int bpf_map_trace_delete_elem(struct bpf_map* map, void* key);
> >>
> >> These hooks work as intended for the following map types:
> >>          BPF_MAP_TYPE_ARRAY
> >>          BPF_MAP_TYPE_PERCPU_ARRAY
> >>          BPF_MAP_TYPE_HASH
> >>          BPF_MAP_TYPE_PERCPU_HASH
> >>          BPF_MAP_TYPE_LRU_HASH
> >>          BPF_MAP_TYPE_LRU_PERCPU_HASH
> >>
> >> The only guarantee about the semantics of these hooks is that they execute
> >> before the operation takes place. We cannot call them with locks held
> >> because the hooked program might try to acquire the same locks. Thus they
> >> may be invoked in situations where the traced map is not ultimately
> >> updated.
> >>
> >> The original proposal suggested exposing a function for each
> >> (map type) x (access type). The problem I encountered is that e.g.
> >> percpu hashtables use a custom function for some access types
> >> (htab_percpu_map_update_elem) but a common function for others
> >> (htab_map_delete_elem). Thus a userspace application would have to
> >> maintain a unique list of functions to attach to for each map type;
> >> moreover, this list could change across kernel versions. Map tracing is
> >> easier to use with fewer functions, at the cost of tracing programs
> >> being triggered more times.
> >
> > Good point about htab_percpu.
> > The patches look good to me.
> > Few minor bits:
> > - pls don't use #pragma once.
> >    There was a discussion not too long ago about it and the conclusion
> >    was that let's not use it.
> >    It slipped into few selftest/bpf, but let's not introduce more users.
> > - noinline is not needed in prototype.
> > - bpf_probe_read is deprecated. Pls use bpf_probe_read_kernel.
> >
> > and thanks for detailed patch 3.
> >
> >> To prevent the compiler from optimizing out the calls to my tracing
> >> functions, I use the asm("") trick described in gcc's
> >> __attribute__((noinline)) documentation. Experimentally, this trick
> >> works with clang as well.
> >
> > I think noinline is enough. I don't think you need that asm in there.
>
> I tried a simple program using clang lto and the optimization
> (optimizing away the call itself) doesn't happen.
>
> [$ ~/tmp2] cat t1.c
>
>
> __attribute__((noinline)) int foo() {
>
>
>    return 0;
>
>
> }
>
>
> [$ ~/tmp2] cat t2.c
>
>
> extern int foo();
>
>
> int main() {
>
>
>    return foo();
>
>
> }
>
>
> [$ ~/tmp2] cat run.sh
>
>
> clang -flto=full -O2 t1.c t2.c -c
>
>
> clang -flto=full -fuse-ld=lld -O2 t1.o t2.o -o a.out
>
>
> [$ ~/tmp2] ./run.sh
>
>
> [$ ~/tmp2] llvm-objdump -d a.out
> ...
> 0000000000201750 <foo>:
>    201750: 31 c0                         xorl    %eax, %eax
>    201752: c3                            retq
>    201753: cc                            int3
>    201754: cc                            int3
>    201755: cc                            int3
>    201756: cc                            int3
>    201757: cc                            int3
>    201758: cc                            int3
>    201759: cc                            int3
>    20175a: cc                            int3
>    20175b: cc                            int3
>    20175c: cc                            int3
>    20175d: cc                            int3
>    20175e: cc                            int3
>    20175f: cc                            int3
>
> 0000000000201760 <main>:
>    201760: e9 eb ff ff ff                jmp     0x201750 <foo>
>
> I remember that even if a call is marked as noinline, the compiler might
> still poke into the call to find some information for some optimization.
> But I guess probably the callsite will be kept. Otherwise, it will be
> considered as "inlining".
>
> Joe, did you hit any issues, esp. with gcc lto?
>
> >
> > In parallel let's figure out how to do:
> > SEC("fentry/bpf_map_trace_update_elem")
> > int BPF_PROG(copy_on_write__update,
> >               struct bpf_map *map,
> >               struct allow_reads_key__old *key,
> >               void *value, u64 map_flags)
> >
> > It kinda sucks that bpf_probe_read_kernel is necessary to read key/values.
> > It would be much nicer to be able to specify the exact struct for the key and
> > access it directly.
> > The verifier does this already for map iterator.
> > It's 'void *' on the kernel side while iterator prog can cast this pointer
> > to specific 'struct key *' and access it directly.
> > See bpf_iter_reg->ctx_arg_info and btf_ctx_access().
> >
> > For fentry into bpf_map_trace_update_elem it's a bit more challenging,
> > since it will be called for all maps and there is no way to statically
> > check that specific_map->key_size is within prog->aux->max_rdonly_access.
> >
> > May be we can do a dynamic cast helper (simlar to those that cast sockets)
> > that will check for key_size at run-time?
> > Another alternative is to allow 'void *' -> PTR_TO_BTF_ID conversion
> > and let inlined probe_read do the job.
> >