Re: [5.10, 5.15] New bpf kselftest failure

Eduard Zingerman <eddyz87@xxxxxxxxx> · Tue, 18 Jul 2023 01:57:10 +0300

On Mon, 2023-07-17 at 10:59 -0400, Luiz Capitulino wrote:
> 
> On 2023-07-17 10:55, Eduard Zingerman wrote:
> 
> > 
> > 
> > 
> > On Mon, 2023-07-17 at 09:04 -0400, Luiz Capitulino wrote:
> > > Hi,
> > > 
> > > The upstream commit below is backported to 5.10.186, 5.15.120 and 6.1.36:
> > > 
> > > """
> > > commit ecdf985d7615356b78241fdb159c091830ed0380
> > > Author: Eduard Zingerman <eddyz87@xxxxxxxxx>
> > > Date:   Wed Feb 15 01:20:27 2023 +0200
> > > 
> > >       bpf: track immediate values written to stack by BPF_ST instruction
> > > """
> > > 
> > > This commit is causing the following bpf:test_verifier kselftest to fail:
> > > 
> > > """
> > > # #760/p precise: ST insn causing spi > allocated_stack FAIL
> > > """
> > > 
> > 
> > I can reproduce the error on 6.1.36 but don't understand what's causing it yet.
> > The log is suspiciously different from master, will comment later today.
> 
> Thank you very much for the prompt reply, Eduard.
> 
> I'm available for further testing if needed.

The test case in question looks as follows:

                           ; old log                 new log
0: R1=ctx(off=0,imm=0)
0: r3 = r10                ; R3=fp0 R10=fp0          R3=fp0 R10=fp0
1: if r3 != 0x7b goto pc+0 ; R3=fp0                  R3=fp0
2: *(u64 *)(r3 -8) = 0     ; R3=fp0 fp-8_w=mmmmmmmm  R3=fp0 fp-8_w=00000000
3: r4 = *(u64 *)(r10 -8)   ; R4_w=scalar() R10=fp0   R4_w=P0 R10=fp0 fp-8_w=00000000
4: r0 = -1                 ; R0=-1                   R0=-1
5: if r4 > r0 goto pc+0
6: exit

The test checks the log generated by __mark_chain_precision() at
instruction #5: registers `r4` and `r0` should be marked precise
(in that order).

The `r4` and `r0` are marked precise by check_cond_jmp_op() function
because jump #5 is predicted (because comparison is unsigned `r4` is
never greater than -1/`r0`).

Failure happens because only log for `r0` markings is present.

Patch [5] changes processing of instruction #2 ( *(u64 *)(r3 -8) = 0; ):
- before patch fp[-8] is marked as mmmmmmmm (any 64-bit scalar value);
- after patch fp[-8] is marked as 00000000.

Which in turn changes processing of instruction #3 ( r4 = *(u64 *)(r10 -8); ):
- before patch r4 is marked as unbound scalar
- after patch r4 is marked as precise zero
  (check_stack_read_fixed_off() has a special case for zero loads,
   DST registers of such loads are always marked precise).

If __mark_chain_precision() is called for a register that is already
marked as precise no log is generated. Thus the log looks different
when check_cond_jmp_op() calls __mark_chain_precision() for `r4` and
then for `r0`:
- before patch `r4` is not precise and log for it is generated;
- after patch `r4` is precise and log for it is not generated.

Now a question:
  why does current master has log for both `r4` and `r0`,
  given that commit [5] is present in master?

This happens because of commit [2] (not back-ported to 6.1.36).

Commit [2] adds function mark_all_scalars_imprecise() that removes
precision marks from scalar registers when new state is created in a
state chain (and for this test new states are created before #1 and #5
because of the BPF_F_TEST_STATE_FREQ flag).

Effectively, before #5 is processed by check_cond_jmp_op()
mark_all_scalars_imprecise() undoes `r4`'s precision mark assigned by
check_stack_read_fixed_off() => log for both `r4` and `r0` is present.

I need some time to convince myself that such interaction between
check_stack_read_fixed_off() and mark_all_scalars_imprecise()
is safe, but that's what we have in master.

Commit [2] is a part of a series [4] by Andrii Nakryiko.
This series had been partially back-ported already.
Commits [0,1,2,3] are not yet back-ported and have to be picked
altogether, otherwise there are numerous test failures.

Still, when I cherry-pick [0,1,2,3] `./test_progs -a setget_sockopt` is failing.
I'll investigate this failure but don't think I'll finish today.

---

Alternatively, if the goal is to minimize amount of changes, we can
disable or modify the 'precise: ST insn causing spi > allocated_stack'.

---

Commits (in chronological order):
[0] be2ef8161572 ("bpf: allow precision tracking for programs with subprogs")
[1] f63181b6ae79 ("bpf: stop setting precise in current state")
[2] 7a830b53c17b ("bpf: aggressively forget precise markings during state checkpointing")
[3] 4f999b767769 ("selftests/bpf: make test_align selftest more robust")
[4] 07d90c72efbe ("Merge branch 'BPF verifier precision tracking improvements'")
[5] ecdf985d7615 ("bpf: track immediate values written to stack by BPF_ST instruction")

> 
> - Luiz
> 
> > 
> > > Since this test didn't fail before ecdf985d76 backport, the question is
> > > if this is a test bug or if this commit introduced a regression.
> > > 
> > > I haven't checked if this failure is present in latest Linus tree because
> > > I was unable to build & run the bpf kselftests in an older distro.
> > > 
> > > Also, there some important details about running the bpf kselftests
> > > in 5.10 and 5.15:
> > > 
> > > * On 5.10, bpf kselftest build is broken. The following upstream
> > > commit needs to be cherry-picked for it to build & run:
> > > 
> > > """
> > > commit 4237e9f4a96228ccc8a7abe5e4b30834323cd353
> > > Author: Gilad Reti <gilad.reti@xxxxxxxxx>
> > > Date:   Wed Jan 13 07:38:08 2021 +0200
> > > 
> > >       selftests/bpf: Add verifier test for PTR_TO_MEM spill
> > > """
> > > 
> > > * On 5.15.120 there's one additional test that's failing, but I didn't
> > > debug this one:
> > > 
> > > """
> > > #150/p calls: trigger reg2btf_ids[reg→type] for reg→type > __BPF_REG_TYPE_MAX FAIL
> > > FAIL
> > > """
> > > 
> > > * On 5.11 onwards, building and running bpf tests is disabled by
> > > default by commit 7a6eb7c34a78498742b5f82543b7a68c1c443329, so I wonder
> > > if we want to backport this to 5.10 as well?
> > > 
> > > Thanks!
> > > 
> > > - Luiz
> > > 
> >