Re: Unifying signed and unsigned min/max tracking

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Mon, 23 Oct 2023 09:50:59 -0700

On Mon, Oct 23, 2023 at 6:14 AM Shung-Hsi Yu <shung-hsi.yu@xxxxxxxx> wrote:
>
> Hi,
>
> CC those who had worked on bound tracking before for feedbacks, as well as
> Dave who works on PREVAIL (verifier used on Windows) and Paul who've written
> about PREVAIL[1], for whether there's existing knowledge on this topic.
>
> Here goes a long one...
>
> ---
>
> While looking at Andrii's patches that improves bounds logic (specifically
> patches [2][3]). I realize we may be able to unify umin/umax/smin/smax into
> just two u64. Not sure if this has already been discussed off-list or is
> being worked upon, but I can't find anything regarding this by searching
> within the BPF mailing list.
>
> For simplicity sake I'll focus on unsigned bounds for now. What we have
> right in the Linux Kernel now is essentially
>
>     struct bounds {
>         u64 umin;
>         u64 umax;
>     }
>
> We can visualize the above as a number line, using asterisk to denote the
> values between umin and umax.
>
>                  u64
>     |----------********************--|
>
> Say if we have umin = A, and umax = B (where B > 2^63). Representing the
> magnitude of umin and umax visually would look like this
>
>     <----------> A
>     |----------********************--|
>     <-----------------------------> B (larger than 2^63)
>
> Now if we go through a BPF_ADD operation and adds 2^(64 - 1) = 2^63,
> currently the verifier will detect that this addition overflows, and thus
> reset umin and umax to 0 and U64_MAX, respectively; blowing away existing
> knowledge.
>
>     |********************************|
>
> Had we used u65 (1-bit more than u64) and tracks the bound with u65_min and
> u65_max, the verifier would have captured the bound just fine. (This idea
> comes from the special case mentioned in Andrii's patch[3])
>
>                                     u65
>     <---------------> 2^63
>                     <----------> A
>     <--------------------------> u65_min = A + 2^63
>     |--------------------------********************------------------|
>     <---------------------------------------------> u65_max = B + 2^63
>
> Continue on this thought further, let's attempting to map this back to u64
> number lines (using two of them to fit everything in u65 range), it would
> look like
>
>                                     u65
>     |--------------------------********************------------------|
>                                vvvvvvvvvvvvvvvvvvvv
>     |--------------------------******|*************------------------|
>                    u64                              u64
>
> And would seems that we'd need two sets of u64 bounds to preserve our
> knowledge.
>
>     |--------------------------******| u64 bound #1
>     |**************------------------| u64 bound #2
>
> Or just _one_ set of u64 bound if we somehow are able to track the union of
> bound #1 and bound #2 at the same time
>
>     |--------------------------******| u64 bound #1
>   U |**************------------------| u64 bound #2
>      vvvvvvvvvvvvvv            vvvvvv  union on the above bounds
>     |**************------------******|
>
> However, this bound crosses the point between U64_MAX and 0, which is not
> semantically possible to represent with the umin/umax approach. It just
> makes no sense.
>
>     |**************------------******| union of bound #1 and bound #2
>
> The way around this is that we can slightly change how we track the bounds,
> and instead use
>
>     struct bounds {
>         u64 base; /* base = umin */
>         /* Maybe there's a better name other than "size" */
>         u64 size; /* size = umax - umin */
>     }
>
> Using this base + size approach, previous old bound would have looked like
>
>     <----------> base = A
>     |----------********************--|
>                <------------------> size = B - A
>
> Looking at the bounds this way means we can now capture the union of bound
> #1 and bound #2 above. Here it is again for reference
>
>     |**************------------******| union of bound #1 and bound #2
>
> Because registers are u64-sized, they wraps, and if we extend the u64 number
> line, it would look like this due to wrapping
>
>                    u64                         same u64 wrapped
>     |**************------------******|*************------------******|
>
> Which can be capture with the base + size semantic
>
>     <--------------------------> base = (u64) A + 2^63
>     |**************------------******|*************------------******|
>                                <------------------> size = B - A,
>                                                     doesn't change after add
>
> Or looking it with just a single u64 number line again
>
>     <--------------------------> base = (u64) A + 2^63
>     |**************------------******|
>     <-------------> base + size = (u64) (B + 2^32)
>
> This would mean that umin and umax is no longer readily available, we now
> have to detect whether base + size wraps to determin whether umin = 0 or
> base (and similar for umax). But the verifier already have the code to do
> that in the existing scalar_min_max_add(), so it can be done by reusing
> existing code.
>
> ---
>
> Side tracking slightly, a benefit of this base + size approach is that
> scalar_min_max_add() can be made even simpler:
>
>     scalar_min_max_add(struct bpf_reg_state *dst_reg,
>                                struct bpf_reg_state *src_reg)
>     {
>         /* This looks too simplistic to have worked */
>         dst_reg.base = dst_reg.base + src_reg.base;
>         dst_reg.size = dst_reg.size + src_reg.size;
>     }
>
> Say we now have another unsigned bound where umin = C and umax = D
>
>     <--------------------> C
>     |--------------------*********---|
>     <----------------------------> D
>
> If we want to track the bounds after adding two registers on with umin = A &
> umax = B, the other with umin = C and umin = D
>
>     <----------> A
>     |----------********************--|
>     <-----------------------------> B
>                      +
>     <--------------------> C
>     |--------------------*********---|
>     <----------------------------> D
>
> The results falls into the following u65 range
>
>     |--------------------*********---|-------------------------------|
>   + |----------********************--|-------------------------------|
>
>     |------------------------------**|**************************-----|
>
> This result can be tracked with base + size approach just fine. Where the
> base and size are as follow
>
>     <------------------------------> base = A + C
>     |------------------------------**|**************************-----|
>                                    <--------------------------->
>                                       size = (B - A) + (D - C)
>
> ---
>
> Now back to the topic of unification of signed and unsigned range. Using the
> union of bound #1 and bound #2 again as an example (size = B - A, and
> base = (u64) A + 2^63)
>
>     |**************------------******| union of bound #1 and bound #2
>
> And look at it's wrapped number line form again
>
>                    u64                         same u64 wrapped
>     <--------------------------> base
>     |**************------------******|*************------------******|
>                                <------------------> size
>
> Now add in the s64 range and align both u64 range and s64 at 0, we can see
> what previously was a bound that umin/umax cannot track is simply a valid
> smin/smax bound (idea drawn from patch [2]).
>
>                                      0
>     |**************------------******|*************------------******|
>                     |----------********************--|
>                                     s64
>
> The question now is be what is the "signed" base so we proceed to calculate
> the smin/smax. Note that base starts at 0, so for s64 the line that
> represents base doesn't start from the left-most location.
> (OTOH size stays the same, so we know it already)
>
>                                     s64
>                                      0
>                                <-----> signed base = ?
>                     |----------********************--|
>                                <------------------> size is the same
>
> If we put u64 range back into the picture again, we can see that the "signed
> base" was, in fact, just base casted into s64, so there's really no need for
> a "signed" base at all
>
>     <--------------------------> base
>     |**************------------******|
>                                      0
>                                <-----> signed base = (s64) base
>                     |----------********************--|
>
> Which shows base + size approach capture signed and unsigned bounds at the
> same time. Or at least its the best attempt I can make to show it.
>
> One way to look at this is that base + size is just a generalization of
> umin/umax, taking advantage of the fact that the similar underlying hardware
> is used both for the execution of BPF program and bound tracking.
>
> I wonder whether this is already being done elsewhere, e.g. by PREVAIL or
> some of static code analyzer, and I can just borrow the code from there
> (where license permits).

A slight alternative, but the same idea, that I had (though after
looking at reg_bounds_sync() I became less enthusiastic about this)
was to unify signed/unsigned ranges by allowing umin u64> umax. That
is, invalid range where umin is greater than umax would mean the wrap
around case (which is also basically smin/smax case when it covers
negative and positive parts of s64/s32 range).

Taking your diagram and annotating it a bit differently:

|**************------------******|
             umax        umin

It will make everything more tricky, but if someone is enthusiastic
enough to try it out and see if we can make this still understandable,
why not?

>
> The glaring questions left to address are:
> 1. Lots of talk with no code at all:
>      Will try to work on this early November and send some result as RFC. In
>      the meantime if someone is willing to give it a try I'll do my best to
>      help.
>
> 2. Whether the same trick applied to scalar_min_max_add() can be applied to
>    other arithmetic operations such as BPF_MUL or BPF_DIV:
>      Maybe not, but we should be able to keep on using most of the existing
>      bound inferring logic we have scalar_min_max_{mul,div}() since base +
>      size can be viewed as a generalization of umin/umax/smin/smax.
>
> 3. (Assuming this base + size approach works) how to integrate it into our
>    existing codebase:
>      I think we may need to refactor out code that touches
>      umin/umax/smin/smax and provide set-operation API where possible. (i.e.
>      like tnum's APIs)
>
> 4. Whether the verifier loss to ability to track certain range that comes
>    out of mixed u64 and s64 BPF operations, and this loss cause some BPF
>    program that passes the verfier to now be rejected.

Very well might be, I've seen some crazy combinations in my testing.
Good thing is that I'm adding a quite exhaustive tests that try all
different boundary conditions. If you check seeds values I used, most
of them are some sort of boundary for signed/unsigned 32/64 bit
numbers. Add to that abstract interpretation model checking, and you
should be able to validate your ideas pretty easily.

>
> 5. Probably more that I haven't think of, feel free to add or comments :)
>
>
> Shung-Hsi
>
> 1: https://pchaigno.github.io/ebpf/2023/09/06/prevail-understanding-the-windows-ebpf-verifier.html
> 2: https://lore.kernel.org/bpf/20231022205743.72352-2-andrii@xxxxxxxxxx/
> 3: https://lore.kernel.org/bpf/20231022205743.72352-4-andrii@xxxxxxxxxx/