Re: Verifier - wild instructions count fluctiations between versions?

Shung-Hsi Yu <shung-hsi.yu@xxxxxxxx> · Tue, 22 Oct 2024 14:13:29 +0800

Hi,

Sorry for coming to this late. Replies are in-line/interleaved, so some
of my comments might be hidden by email client.

On Mon, Sep 23, 2024 at 07:26:25PM GMT, Eduard Zingerman wrote:
> On Mon, 2024-09-23 at 19:35 +0100, Alasdair McWilliam wrote:
> > Hello,
> > 
> > First post so please be gentle :-)
> > 
> > I've got an eBPF workload running on kernel 6.1 LTS and we're running great.
> > 
> > Use case actually is using eBPF in combination with XDP and AF_XDP for
> > volumetric DDoS mitigation.
> > 
> > Makeup of the eBPF program is mostly packet parsing, LPM and map
> > lookups, and 2x calls to the bpf_loop() helper. Currently no iterators,
> > dynptrs, etc, but lots of switch-case blocks.
> > 
> > I've started to test newer kernel versions in preparation to upgrade our
> > stack from 6.1 LTS to 6.6 LTS to gain access to newer functionality and
> > just for future proofing. However, when loading the BPF object code on a
> > 6.6 kernel, the BPF verifier refuses to load the program that 6.1
> > accepts and runs well.
> > 
> > This caught me by surprise, because I have witnessed our stack boot
> > successfully on a 6.7 kernel. So, I've run veristat [0] on the exact
> > same eBPF object file, compiled by clang17, but each time running on a
> > different kernel version. Results fluctuate wildly!
> > 
> > Results on 6.1.106: success: 53687 insns and 5114 states [1]
> > Results on 6.6.52:  failure: 1000001 insns and 39501 states [2]
> > Results on 6.7.9:   success: 131418 insns and 8839 states [3]
> 
> Hi Alasdair,
> 
> It might be the case that your issues with bpf_loop() are triggered by
> the following commit:
> - "bpf: verify callbacks as if they are called unknown number of times":
>   - ab5cfac139ab for 6.7.y
>   - b43550d7d58e for 6.6.y
>   - not backported to 6.1.y
> 
> This commit is a correctness fix, w/o it bodies of the loop callbacks
> were not checked exhaustively. But side effect of this fix is
> significant verification time regression for some programs.
> 
> Comparing BPF related commits in both branches (starting from merge
> base, using script from the attachment) gives somewhat sporadic
> results:
> 
>   Commits stats:
>     only in stable/linux-6.6.y    : 50
>     only in stable/linux-6.7.y    : 96
>     common                        : 74
> 
>   Only in stable/linux-6.6.y:
>     ...
> 
>   Only in stable/linux-6.7.y:
>     ...
> Of these only "bpf: Improve JEQ/JNE branch taken logic" from 6.7
> looks like an optimization, however it did not show any changes in
> veristat data for selftests.

I've also tried to look at this using a different script based on
in-house tool and come to roughly the same conclusion on the 6.7 side.
Nothing specifically strikes out to me in 6.7 that would explain the
difference.

OTOH 6.7.9 is _missing_ a fix that was backported to 6.6.52 --
e9a8e5a587ca "bpf: check bpf_func_state->callback_depth when pruning
states". It was backported to 6.7.10, bu 6.7.9 doesn't have it yet.

Since it prevents (improper) pruning, it could explain what we're seeing
here.

@Alasdair could you give 6.7.12 a quick try (I suppose that would be
easier since you already tested 6.7.9) and see how it goes there?

Additionally, here's v6.1.y branch, containing the "bpf: verify
callbacks as if they are called unknown number of times" fix Eduard
mentioned,

  https://github.com/shunghsiyu/linux/tree/stable/linux-6.1.y-callback-fixes-w-subprog-precision-v1

that I plan to submit (though long overdue). If @Alasdair could also
test it out it is highly appreciated.

Let me know if there's anything that would make things easier.

Thanks,
Shung-Hsi

> => it's hard to say what's missing from 6.6 for your use-case.
> 
> Maybe let's discuss options for your program optimization
> with regards to verifier performance?
> 
> Thanks,
> Eduard
> 
> P.S. hope I did not mess up the script.