Re: [PATCH bpf] bpf: unconditionally reset backtrack_state masks on global func exit

Eduard Zingerman <eddyz87@xxxxxxxxx> · Wed, 20 Sep 2023 04:00:28 +0300

On Tue, 2023-09-19 at 13:56 -0700, Andrii Nakryiko wrote:
> On Tue, Sep 19, 2023 at 11:59 AM Andrii Nakryiko
> <andrii.nakryiko@xxxxxxxxx> wrote:
> > 
> > On Tue, Sep 19, 2023 at 2:06 AM Alexei Starovoitov
> > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > 
> > > On Mon, Sep 18, 2023 at 2:01 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> > > > 
> > > > In mark_chain_precision() logic, when we reach the entry to a global
> > > > func, it is expected that R1-R5 might be still requested to be marked
> > > > precise. This would correspond to some integer input arguments being
> > > > tracked as precise. This is all expected and handled as a special case.
> > > > 
> > > > What's not expected is that we'll leave backtrack_state structure with
> > > > some register bits set. This is because for subsequent precision
> > > > propagations backtrack_state is reused without clearing masks, as all
> > > > code paths are carefully written in a way to leave empty backtrack_state
> > > > with zeroed out masks, for speed.
> > > > 
> > > > The fix is trivial, we always clear register bit in the register mask, and
> > > > then, optionally, set reg->precise if register is SCALAR_VALUE type.
> > > > 
> > > > Reported-by: Chris Mason <clm@xxxxxxxx>
> > > > Fixes: be2ef8161572 ("bpf: allow precision tracking for programs with subprogs")
> > > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> > > > ---
> > > >  kernel/bpf/verifier.c | 8 +++-----
> > > >  1 file changed, 3 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > > index bb78212fa5b2..c0c7d137066a 100644
> > > > --- a/kernel/bpf/verifier.c
> > > > +++ b/kernel/bpf/verifier.c
> > > > @@ -4047,11 +4047,9 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno)
> > > >                                 bitmap_from_u64(mask, bt_reg_mask(bt));
> > > >                                 for_each_set_bit(i, mask, 32) {
> > > >                                         reg = &st->frame[0]->regs[i];
> > > > -                                       if (reg->type != SCALAR_VALUE) {
> > > > -                                               bt_clear_reg(bt, i);
> > > > -                                               continue;
> > > > -                                       }
> > > > -                                       reg->precise = true;
> > > > +                                       bt_clear_reg(bt, i);
> > > > +                                       if (reg->type == SCALAR_VALUE)
> > > > +                                               reg->precise = true;
> > > 
> > > Looks good, but is there a selftest that can demonstrate the issue?
> > 
> > I'll see if I can write something small and reliable.
> 
> I give up. It seems like lots of conditions have to come together to
> trigger this. In production it was an application that happened to
> finish global func validation with that r1 set as precise in
> backtrack_state, and then proceeded to have some equivalent state
> matched immediately, which triggered propagate_precision() ->
> mark_chain_precision_batch(), but doing propagation of r9. Then with
> this bug we were looking to propagate r1 and r9, but the code path
> under verification didn't have any instruction touching r1 until we
> bubbled back up to helper call instruction, where verifier complained
> about r1 being required to be precise right after helper call (which
> is illegal, as r1-r5 are clobbered).
> 
> Few simple tests I tried failed to set up all the necessary conditions
> to trigger this in the exact sequence necessary. The fix is simple and
> well understood, I'd vote for landing it, given crafting a test is
> highly non-trivial.

I agree with this change and it does not cause any issues when tested locally.
As far as I understand, to make a test case one needs to:
- make a special async callback that would leave some garbage "r1-r5"
  bits set in frame zero;
- trick verifier to check the async callback first and then jump to a
  state where precision marks on "r1-r5" are forbidden, e.g. a point
  right after pseudo call, so that backtrack_insn would jump to
  BPF_EXIT and complain.
Crafting such testcase sounds tricky and it might be fragile.

It appears to me that from time to time we get to situations when
having kunit tests would be beneficial ¯\_(ツ)_/¯.