BTF generation and pruning (notes from office hours)

David Faust <david.faust@xxxxxxxxxx> · Thu, 25 Jan 2024 14:56:58 -0800

This morning in the BPF office hours we discussed BTF, starting from
some specific cases where gcc and clang differ, and ending up at the
broader question of what precisely should or should not be present
in generated BTF info and in what cases.

Below is a summary/notes on the discussion so far. Apologies if I've
forgotten anything.

Motivation: there are some cases where gcc emits more BTF information
than clang, in particular (not necessarily exhaustive):
  + clang does not emit BTF for unused static vars
  + clang does not emit BTF for variables which have been optimized
    away entirely
  + clang does not emit BTF for types which are only used by one
    of the above
  (See a couple of concrete examples at the bottom.)

One reason for this is implementation differences in the compiler.
- In clang, BTF is generated late, in the BPF backend, after most
  optimizations have happened.
- In gcc, BTF is currently generated similarly to DWARF. This means:
  + It reflects more closely the types/vars etc. in input source
  + It is earlier; many optimizations have not happened yet, so
    variables which eventually get optimized away are still present.

Another reason is size concern. Clang deliberately does not add
some types or do pointer chasing in some cases to avoid adding many
BTF records for types not immediately relevant to the program. The
obvious example is bpf_helpers.h or vmlinux.h - programs often need
just a few helpers and ignore the rest, but by including them end
up pulling in thousands of types which they do not use.
- This also comes with some drawbacks, in some cases BTF will not
  be emitted when it is desired. There is a BTF_TYPE_EMIT macro to
  work around that. It isn't a perfect solution.

So, the question is twofold:
1. What ought to be represented in BTF for a BPF program?
2. Is that/should that be followed for non-BPF program cases, such
   as generating BTF for vmlinux?

Discussion / things that were generally agreed on:
- BTF for a BPF program should represent exactly what is in the
  final program; things like variables which are optimized away
  entirely should not be represented. Note that this differs from
  other debug formats like DWARF which more closely represent the
  original source.
  + In addition, things like static variables which are not used
    are not represented.

  Reasons:
  1. BTF for a BPF program is primarily of use to the BPF loader,
     so representing in BTF things which no longer exist in the
     actual BPF program is counter-productive.

  2. Size. BPF programs including bpf_helpers.h or vmlinux.h pull
     in many many types which are not used. Representing all
     those bloats the BTF significantly for no gain.

- BTF for vmlinux currently is similar, and aims to represent what
  is actually there. The end goal for BTF is to to have everything
  needed for full visibility for tracing. Size of BTF is also a
  concern; there are many things which pahole omits, like global
  variables. 

- BTF itself is not specific to BPF. gcc supports -gbtf for any
  target. So it does not make sense to always prune types as though
  generating BTF for a BPF program.

- There are also cases for BPF where it makes sense for the compiler
  to not try to be too clever about what to prune, and rather
  leave it up to something else. For example, if in the future
  BTF for the kernel is generated from the compiler and pahole
  is used to do BTF->BTF translation, it makes sense to have the
  compiler emit everything, and let pahole decide what to prune.

- We could add some sort of compiler flag, -fprune-btf or so,
  to control this behavior. Initially we thought of 3 levels,
  but narrowed it down to two being useful:
  0 - compiler does no additional pruning, BTF is closer to source,
      how gcc behaves now
  1 - compiler does pruning as though for a BPF program,
      represents only what is in final program
      how clang behaves now
  (With only two levels, the flag just becomes an on/off switch
   to control the pruning step)

- For this flag, we need to have the precise criteria used in
  clang to determine what to prune. Probably this should also
  be documented somehow(?)

- LTO, the linker (as in ld), and BTF deduplication. 
  + For DWARF LTO is more complicated because of call site info.
  + For BTF right now: no LTO for BPF programs.
    Supposing linker did BTF dedup, right now nothing additional
    would be needed for LTO.
  + If at some point BTF adds call site info, linker could simply
    discard BTF from the first compiler invocation and dedup BTF
    emitted by the second compiler invocation (assumes BTF emission
    in finish() rather than early_finish() for gcc).

- We had some discussion of how all this could affect/interact with
  things like split BTF for vmlinux, but I don't think we reached
  any conclusions. Input appreciated.

===========
examples discussed, for reference

1. BTF for unused static global variable and its types
$ cat reduced.c
typedef long long unsigned int __u64;

struct bpf_timer {
  __u64 __opaque[2];
} __attribute__((preserve_access_index));

static long (*bpf_timer_set_callback)(struct bpf_timer *timer, void *callback_fn) = (void *) 170;
char LICENSE[] __attribute__((section("license"), used)) = "GPL";

gcc
$ ~/toolchains/bpf/bin/bpf-unknown-none-gcc -c -gbtf -O2 reduced.c -o reduced.o.gcc
$ /usr/sbin/bpftool btf dump file reduced.o.gcc
[1] INT 'long long unsigned int' size=8 bits_offset=0 nr_bits=64 encoding=(none)
[2] TYPEDEF '__u64' type_id=1
[3] STRUCT 'bpf_timer' size=16 vlen=1
	'__opaque' type_id=5 bits_offset=0
[4] INT 'long unsigned int' size=8 bits_offset=0 nr_bits=64 encoding=(none)
[5] ARRAY '(anon)' type_id=2 index_type_id=4 nr_elems=2
[6] INT 'long int' size=8 bits_offset=0 nr_bits=64 encoding=SIGNED
[7] FUNC_PROTO '(anon)' ret_type_id=6 vlen=2
	'(anon)' type_id=8
	'(anon)' type_id=9
[8] PTR '(anon)' type_id=3
[9] PTR '(anon)' type_id=0
[10] PTR '(anon)' type_id=7
[11] INT 'char' size=1 bits_offset=0 nr_bits=8 encoding=SIGNED
[12] ARRAY '(anon)' type_id=11 index_type_id=4 nr_elems=4
[13] VAR 'bpf_timer_set_callback' type_id=10, linkage=static
[14] VAR 'LICENSE' type_id=12, linkage=global
[15] DATASEC 'license' size=0 vlen=1
	type_id=14 offset=0 size=4 (VAR 'LICENSE')

clang:
$ ~/toolchains/llvm/bin/clang -target bpf -c -g -O2 reduced.c -o reduced.o.clang
$ /usr/sbin/bpftool btf dump file reduced.o.clang
[1] INT 'char' size=1 bits_offset=0 nr_bits=8 encoding=SIGNED
[2] ARRAY '(anon)' type_id=1 index_type_id=3 nr_elems=4
[3] INT '__ARRAY_SIZE_TYPE__' size=4 bits_offset=0 nr_bits=32 encoding=(none)
[4] VAR 'LICENSE' type_id=2, linkage=global
[5] DATASEC 'license' size=0 vlen=1
	type_id=4 offset=0 size=4 (VAR 'LICENSE')

Note how clang does not include any BTF info for bpf_timer_set_callback,
since it is a variable which is not used in the program. This elides
all the types used only by it as well.

===================

2. BTF for variable which is entirely optimized away
$ cat optvar.c
static int a = 5;

int foo (int x) {
	return a + x;
}

gcc:
$ ~/toolchains/bpf/bin/bpf-unknown-none-gcc -c -gbtf -O2 optvar.c -o optvar.o.gcc
$ /usr/sbin/bpftool btf dump file optvar.o.gcc
[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED
[2] FUNC_PROTO '(anon)' ret_type_id=1 vlen=1
	'x' type_id=1
[3] VAR 'a' type_id=1, linkage=static
[4] FUNC 'foo' type_id=2 linkage=global

clang:
$ ~/toolchains/llvm/bin/clang -target bpf -c -g -O2 optvar.c -o optvar.o.clang
$ /usr/sbin/bpftool btf dump file optvar.o.clang
[1] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED
[2] FUNC_PROTO '(anon)' ret_type_id=1 vlen=1
	'x' type_id=1
[3] FUNC 'foo' type_id=2 linkage=global

Simple case, variable 'a' gets completely optimized away and
replaced with literal 5 when used. Clang does not include a
VAR record for it, but gcc does.