Re: 10x regression in processed instructions after recent verifier change

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Wed, 3 Apr 2024 13:01:20 -0700

On Wed, Apr 3, 2024 at 11:59 AM Brennan Vincent
<brennan.vincent@xxxxxxxxxxxxxxxx> wrote:
>
> Hello,
>
> I am working on Parca, a continuous CPU time profiler which works by
> doing DWARF-based stack unwinding in an eBPF program. In 6.8 series
> kernels, our unwinder can no longer be loaded, as it exceeds the
> verifier's limit of 1M instructions processed.
>
> I was eventually able to get it working, so this is not exactly a
> support request, but rather a report of an issue that might affect other
> people.
>
> The issue is 100% reproducible and bisects to
> 67420501e8681ae18f9f0ea0a69cd2f432100e70. Hacking the kernel a bit to
> increase the instructions processed limit, I found that it successfully
> verifies after ~3M instructions, whereas previously it took ~250k, so
> more than a 10x regression.
>
> In case it matters, I'm on arm64.
>
> I am not an expert on the internal details of the verifier, but I have a
> vague theory on what can be causing this behavior. Our unwinder has a
> few places where it needs to do binary search on an array, which look
> similar to this:
>
> #define MAX_UNWIND_INFO_BINARY_SEARCH_DEPTH 19
> #define MAX_UNWIND_TABLE_SIZE 250 * 1000
>
> static u64 find_offset_for_pc(stack_unwind_table_t *table, u64 pc, u64 left, u64 right) {
>     u64 found = BINARY_SEARCH_DEFAULT;
>
>     for (int i = 0; i < MAX_UNWIND_INFO_BINARY_SEARCH_DEPTH; i++) {
>         if (left >= right) {
>             LOG("\t.done");
>             return found;
>         }
>
>         u32 mid = (left + right) / 2;
>
>         // Appease the verifier.
>         if (mid < 0 || mid >= MAX_UNWIND_TABLE_SIZE) {
>             bump_unwind_error_should_never_happen();
>             return BINARY_SEARCH_SHOULD_NEVER_HAPPEN;
>         }
>
>         if (table->rows[mid].pc <= pc) {
>             found = mid;
>             left = mid + 1;
>         } else {
>             right = mid;
>         }
>     }
>     return BINARY_SEARCH_EXHAUSTED_ITERATIONS;
> }
>
> (For more examples see
> https://github.com/parca-dev/parca-agent/blob/main/bpf/unwinders/native.bpf.c
> .)
>
> Note that as I said, I'm not an expert on the verifier, so my theory
> could be totally wrong. Please assume I put qualifiers like "I
> think...", "I suspect...", etc. in the proper places in what
> follows.
>
> The commit in question made the verifier track the bounds of variables
> more precisely, which might allow it to verify more programs. This,
> however, has a downside which is that it causes an explosion in the
> state space. In previous versions of the kernel, we could just verify
> this snippet once because the verifier did not precisely track the
> bounds of `mid`. In the new version, we have different bounds for `mid`
> in the different branches of the main `if` statement, and the set of
> different bounds multiplies in size by 2 each time through the loop,
> causing this huge explosion.
>
> Now, if I intentionally confuse the verifier and prevent it from knowing
> the bounds of `mid`, e.g.:
>
> static u32 __attribute__((optnone)) scramble(u32 val) {
>         return val ^ 0xFFFFFFFF;
> }
>
> then, in the main binary search loop:
>
>         u32 mid = (left + right) / 2;
>         mid = scramble(scramble(mid));
>
> then the program verifies fine, as it did before the commit in question.

Yep, I think your analysis is correct, and your obfuscation workaround
is "sane" :)

Having said that, I think it would benefit your BPF programs from
thinking a bit more carefully about scaling your BPF verification
process by using various features and techniques that keep
verification complexity more bounded.

Specifically, I think you can benefit from making find_offset_for_pc a
global function, in which case it will be validated just once,
separately, and then not re-validated.

Also, if you can rely on that, using bpf_for() loops (based on
open-coded iterators) or bpf_loop() helper would also help minimize
verification complexity of your loops.

>
> I am not sure what, if anything, should be done about this in the kernel,
> as it seems there's a fundamental tradeoff between precision of the
> verifier and performance in terms of instructions processed.
>
> --
> Brennan (he/him)
> Staff Software Engineer | Polar Signals Inc.
> E: brennan.vincent@xxxxxxxxxxxxxxxx
> W: umanwizard.com
> TZ: America/New_York