Re: Help with verifier failure

Brendan Jackman <jackmanb@xxxxxxxxxx> · Fri, 7 May 2021 17:05:53 +0200

On Thu, 22 Apr 2021 at 16:35, Yonghong Song <yhs@xxxxxx> wrote:
>
>
>
> On 4/22/21 6:55 AM, Brendan Jackman wrote:
> > On Wed, 21 Apr 2021 at 18:59, Yonghong Song <yhs@xxxxxx> wrote:
> >> On 4/21/21 8:06 AM, Yonghong Song wrote:
> >>> On 4/21/21 5:23 AM, Brendan Jackman wrote:
> >>> Thanks, Brendan. Looks at least the verifier failure is triggered
> >>> by recent clang changes. I will take a look whether we could
> >>> improve verifier for such a case and whether we could improve
> >>> clang to avoid generate such codes the verifier doesn't like.
> >>> Will get back to you once I had concrete analysis.
> >>>
> >>>>
> >>>> This seems like it must be a common pitfall, any idea what we can do
> >>>> to fix it
> >>>> and avoid it in future? Am I misunderstanding the issue?
> >>
> >> First, for the example code you provided, I checked with llvm11, llvm12
> >> and latest trunk llvm (llvm13-dev) and they all generated similar codes,
> >> which may trigger verifier failure. Somehow you original code could be
> >> different may only show up with a recent llvm, I guess.
> >>
> >> Checking llvm IR, the divergence between "w2 = w8" and "if r8 < 0x1000"
> >> appears in insn scheduling phase related handling PHIs. Need to further
> >> check whether it is possible to prevent the compiler from generating
> >> such codes.
> >>
> >> The latest kernel already had the ability to track register equivalence.
> >> However, the tracking is conservative for 32bit mov like "w2 = w8" as
> >> you described in the above. if we have code like "r2 = r8; if r8 <
> >> 0x1000 ...", we will be all good.
> >>
> >> The following hack fixed the issue,
> >>
> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >> index 58730872f7e5..54f418fd6a4a 100644
> >> --- a/kernel/bpf/verifier.c
> >> +++ b/kernel/bpf/verifier.c
> >> @@ -7728,12 +7728,20 @@ static int check_alu_op(struct bpf_verifier_env
> >> *env, struct bpf_insn *insn)
> >>                                                   insn->src_reg);
> >>                                           return -EACCES;
> >>                                   } else if (src_reg->type == SCALAR_VALUE) {
> >> +                                       /* If src_reg is in 32bit range,
> >> there is
> >> +                                        * no need to reset the ID.
> >> +                                        */
> >> +                                       bool is_32bit_src =
> >> src_reg->umax_value <= 0x7fffffff;
> >> +
> >> +                                       if (is_32bit_src && !src_reg->id)
> >> +                                               src_reg->id = ++env->id_gen;
> >>                                           *dst_reg = *src_reg;
> >>                                           /* Make sure ID is cleared
> >> otherwise
> >>                                            * dst_reg min/max could be
> >> incorrectly
> >>                                            * propagated into src_reg by
> >> find_equal_scalars()
> >>                                            */
> >> -                                       dst_reg->id = 0;
> >> +                                       if (!is_32bit_src)
> >> +                                               dst_reg->id = 0;
> >>                                           dst_reg->live |= REG_LIVE_WRITTEN;
> >>                                           dst_reg->subreg_def =
> >> env->insn_idx + 1;
> >>                                   } else {
> >>
> >> Basically, for a 32bit mov insn like "w2 = w8", if we can ensure
> >> that "w8" is 32bit and has no possibility that upper 32bit is set
> >> for r8, we can declare them equivalent. This fixed your issue.

I just got around to looking into this - spent some time reading and
realised it's simpler than I thought :) I also double checked that it
fixes the test with my current Clang too.

Beyond cleaning up and putting it into a patch, did you have anything
in particular in mind when you called this a "hack"?

Do I understand correctly that in this code we only need to check
umax_value, because it anyway gets folded into the other bounds fields
during adjust_min_max_reg_vals?

It seems like the next rung on the "ladder" of solution completeness
here would be quite a big step up, something like a more comprehensive
representation of register relationships (instead of just "these regs
have the same value" vs. "these regs have no relationship"), which I
guess would be more extreme than necessary right now.