[ANNOUNCEMENT] Automated multi-kernel libbpf testing

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 19 Mar 2020 16:24:55 -0700

Libbpf is the important part of the BPF ecosystem and it defines
modern ways to build and run BPF applications. As such, it’s crucial
that it is well-tested, reliable, and seamlessly works across multiple
kernels. Until recently, the only testing that was performed were BPF
selftests, run manually by BPF maintainers against the bleeding edge
versions of kernel. As diligent as maintainers are, this setup is not
perfect, requiring a lot of manual work, and could still miss
regressions and bugs due to kernel and environment differences.
Catching regressions on old kernels was especially ominous leading to
real problems in production at Facebook.

This seemed like a problem that needed automation. We took the idea of
our internal VMTEST framework, which allows to run application
integration tests against a range of kernels to catch problems, and
applied it to open-source Github mirror [0] of libbpf. We built upon
Omar Sandoval’s <osandov@xxxxxx> initial implementation for his drgn
tool [1] and adapted it to libbpf needs. It saved many hours of
tinkering with generic qemu/Linux image setup! Julia Kartseva <hex@xxxxxx>
spent lots of time and efforts on bringing this workflow to libbpf
and making process robust and maintainable.

Now, with each change to libbpf, we’ll pull and compile the latest
kernel and the latest BPF selftests using libbpf with patches to be
tested. Next, a VM with that kernel will start and will run a battery
of tests (test_progs, test_verifier, and test_maps), verifying that
both libbpf and the kernel are still working as expected. Further, to
verify libbpf didn’t regress on older kernels, we’ll download a set of
older kernels and will perform a supported subset of tests against
each of those kernels. This gives us confidence that no matter how
bleeding-edge libbpf library you use, it will still work fine across
all kernels. Check out a typical Travis CI test run [2] to get a
better idea. You can also see an annotated list [3] of blacklisted
tests for older kernel.

# Why does this matter?

- It’s all about confidence when making BPF changes and about
maintaining user trust. Automated, repeatable testing on **every**
change to libbpf is crucial for allowing BPF developers to move fast
and iterate quickly, while ensuring there is no inadvertent breakage
of BPF applications. The more libbpf is integrated into critical
applications (systemd, iproute2, bpftool, BCC tools, as well as
multitude of internal apps across private companies), the more
important this becomes.

- Well-tested and maintained libbpf Github mirror (as opposed to
building from kernel sources) as a single source of truth is important
for package maintainers to ensure consistent libbpf versioning across
different Linux distributions. This results in better user experience
overall and everyone wins from this consistency.

- This is also a good base for a more general kernel testing, given
that this test setup exercises not just libbpf, but the kernel itself
as well. With a bit more automation, it is possible to proactively
apply upstream patches and test kernel changes, saving tons of BPF
maintainers time and speeding up the patch review process.

In a short time we’ve had this running, this setup already caught
kernel, libbpf, and selftests bugs (and undoubtedly will catch more):
- BPF trampoline assembly bug [4];
- Kprobe tests triggering bug [5];
- Test cleanup crashes [6];
- Tests flakiness [7];
- Quite a few libbpf-specific problems we’ve never got to track explicitly...

  [0] https://github.com/libbpf/libbpf
  [1] https://github.com/osandov/drgn
  [2] https://travis-ci.org/github/libbpf/libbpf/builds/663674948
  [3] https://github.com/libbpf/libbpf/blob/master/travis-ci/vmtest/configs/blacklist/BLACKLIST-5.5.0
  [4] https://lore.kernel.org/netdev/20200311003906.3643037-1-ast@xxxxxxxxxx/
  [5] https://patchwork.ozlabs.org/patch/1254743/
  [6] https://lore.kernel.org/netdev/20200220230546.769250-1-andriin@xxxxxx/
  [7] https://lore.kernel.org/bpf/20200314024855.ugbvrmqkfq7kao75@xxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/#ma733d8e9840d9f91ce20d1143a429aa0d6650959