Re: 10x regression in processed instructions after recent verifier change

"Brennan Vincent" <brennan.vincent@xxxxxxxxxxxxxxxx> · Wed, 03 Apr 2024 16:09:26 -0400

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes:

> On Wed, Apr 3, 2024 at 11:59 AM Brennan Vincent
> <brennan.vincent@xxxxxxxxxxxxxxxx> wrote:
>>
>> Hello,
>>
>> I am working on Parca, a continuous CPU time profiler which works by
>> doing DWARF-based stack unwinding in an eBPF program. In 6.8 series
>> kernels, our unwinder can no longer be loaded, as it exceeds the
>> verifier's limit of 1M instructions processed.
>>
>> I was eventually able to get it working, so this is not exactly a
>> support request, but rather a report of an issue that might affect other
>> people.
>>
>> The issue is 100% reproducible and bisects to
>> 67420501e8681ae18f9f0ea0a69cd2f432100e70. Hacking the kernel a bit to
>> increase the instructions processed limit, I found that it successfully
>> verifies after ~3M instructions, whereas previously it took ~250k, so
>> more than a 10x regression.
>>
>> In case it matters, I'm on arm64.
>>
>> I am not an expert on the internal details of the verifier, but I have a
>> vague theory on what can be causing this behavior. Our unwinder has a
>> few places where it needs to do binary search on an array, which look
>> similar to this:
>>
>> #define MAX_UNWIND_INFO_BINARY_SEARCH_DEPTH 19
>> #define MAX_UNWIND_TABLE_SIZE 250 * 1000
>>
>> static u64 find_offset_for_pc(stack_unwind_table_t *table, u64 pc, u64 left, u64 right) {
>>     u64 found = BINARY_SEARCH_DEFAULT;
>>
>>     for (int i = 0; i < MAX_UNWIND_INFO_BINARY_SEARCH_DEPTH; i++) {
>>         if (left >= right) {
>>             LOG("\t.done");
>>             return found;
>>         }
>>
>>         u32 mid = (left + right) / 2;
>>
>>         // Appease the verifier.
>>         if (mid < 0 || mid >= MAX_UNWIND_TABLE_SIZE) {
>>             bump_unwind_error_should_never_happen();
>>             return BINARY_SEARCH_SHOULD_NEVER_HAPPEN;
>>         }
>>
>>         if (table->rows[mid].pc <= pc) {
>>             found = mid;
>>             left = mid + 1;
>>         } else {
>>             right = mid;
>>         }
>>     }
>>     return BINARY_SEARCH_EXHAUSTED_ITERATIONS;
>> }
>>
>> (For more examples see
>> https://github.com/parca-dev/parca-agent/blob/main/bpf/unwinders/native.bpf.c
>> .)
>>
>> Note that as I said, I'm not an expert on the verifier, so my theory
>> could be totally wrong. Please assume I put qualifiers like "I
>> think...", "I suspect...", etc. in the proper places in what
>> follows.
>>
>> The commit in question made the verifier track the bounds of variables
>> more precisely, which might allow it to verify more programs. This,
>> however, has a downside which is that it causes an explosion in the
>> state space. In previous versions of the kernel, we could just verify
>> this snippet once because the verifier did not precisely track the
>> bounds of `mid`. In the new version, we have different bounds for `mid`
>> in the different branches of the main `if` statement, and the set of
>> different bounds multiplies in size by 2 each time through the loop,
>> causing this huge explosion.
>>
>> Now, if I intentionally confuse the verifier and prevent it from knowing
>> the bounds of `mid`, e.g.:
>>
>> static u32 __attribute__((optnone)) scramble(u32 val) {
>>         return val ^ 0xFFFFFFFF;
>> }
>>
>> then, in the main binary search loop:
>>
>>         u32 mid = (left + right) / 2;
>>         mid = scramble(scramble(mid));
>>
>> then the program verifies fine, as it did before the commit in question.
>
> Yep, I think your analysis is correct, and your obfuscation workaround
> is "sane" :)
>
> Having said that, I think it would benefit your BPF programs from
> thinking a bit more carefully about scaling your BPF verification
> process by using various features and techniques that keep
> verification complexity more bounded.
>
> Specifically, I think you can benefit from making find_offset_for_pc a
> global function, in which case it will be validated just once,
> separately, and then not re-validated.
>
> Also, if you can rely on that, using bpf_for() loops (based on
> open-coded iterators) or bpf_loop() helper would also help minimize
> verification complexity of your loops.

Thanks. I'll look into all those suggestions.

>>
>> I am not sure what, if anything, should be done about this in the kernel,
>> as it seems there's a fundamental tradeoff between precision of the
>> verifier and performance in terms of instructions processed.
>>
>> --
>> Brennan (he/him)
>> Staff Software Engineer | Polar Signals Inc.
>> E: brennan.vincent@xxxxxxxxxxxxxxxx
>> W: umanwizard.com
>> TZ: America/New_York
-- 
Brennan (he/him)
Staff Software Engineer | Polar Signals Inc.
E: brennan.vincent@xxxxxxxxxxxxxxxx
W: www.umanwizard.com
TZ: America/New_York