Re: [PATCH bpf-next v2 0/3] Add XDP support for bpf_load_hdr_opt

Yonghong Song <yhs@xxxxxx> · Tue, 19 Oct 2021 09:02:33 -0700

On 10/18/21 5:00 PM, Alexei Starovoitov wrote:
On Sat, Oct 09, 2021 at 12:20:27AM +0200, Toke Høiland-Jørgensen wrote:

So if we can't fix the verifier, maybe we could come up with a more
general helper for packet parsing? Something like:

bpf_for_each_pkt_chunk(ctx, offset, callback_fn, callback_arg)
{
   ptr = ctx->data + offset;
   while (ptr < ctx->data_end) {
     offset = callback_fn(ptr, ctx->data_end, callback_arg);
     if (offset == 0)
       return 0;
     ptr += offset;
   }

   // out of bounds before callback was done
   return -EINVAL;
}

We're starting to work on this since it will be useful not only for
packet parsing, TLV parsing, but potentially any kind of 'for' loop iteration.

This would work for parsing any kind of packet header or TLV-style data
without having to teach the kernel about each header type. It'll have
quite a bit of overhead if all the callbacks happen via indirect calls,
but maybe the verifier can inline the calls (or at least turn them into
direct CALL instructions)?

Right. That's the main downside.
If the bpf_for_each*() helper is simple enough the verifier can inline it
similar to map_gen_lookup. In such case the indirect call will be a direct call,
so the overhead won't be that bad, but it's still a function call and
static function will have full prologue+epilogue.
Converting static function into direct jump would be really challenging
for the verifier and won't provide much benefit, since r6-r9 save/restore
would need to happen anyway even for such 'inlined' static func, since
llvm will be freely using r6-r9 for insns inside function body
assuming that it's a normal function.

May be there is a way to avoid call overhead with with clang extensions.
If we want to do:
int mem_eq(char *p1, char *p2, int size)
{
   int i;
   for (i = 0; i < size; i++)
     if (p1[i] != p2[i])
       return 0;
   return 1;
}

With clang extension we might write it as:
int bpf_mem_eq(char *p1, char *p2, int size)
{
   int i = 0;
   int iter;

   iter = __builtin_for_until(i, size, ({
       if (p1[i] != p2[i])
         goto out;
   }));
   out:
   if (iter != size)
     return 0;
   return 1;
}

The llvm will generate pseudo insns for this __builtin_for.
The verifier will analyze the loop body for the range [0, size)
and replace pseudo insns with normal branches after the verification.
We might even keep the normal C syntax for loops and use
llvm HardwareLoops pass to add pseudo insns.
It's more or less the same ideas for loops we discussed before
bounded loops were introduced.
The main problem with bounded loops is that the loop body will
typically be verified the number of times equal to the number of iterations.
So for any non-trivial loop such iteration count is not much more
than 100. The verifier can do scalar evolution analysis, but
it's likely won't work for many cases and user experience
will suffer. With __builtin_for the scalar evolution is not necessary,
since induction variable is one and explicit and its range is explicit too.
That enables single pass over loop body.
One might argue that for (i = 0; i < 10000; i += 10) loops are
necessary too, but instead of complicating the verifier with sparse
ranges it's better to put that on users that can do:
   iter = __builtin_for_until(i, 10000 / 10, ({
       j = i * 10;
       use j;
   }));
Single explicit induction variable makes the verification practical.
The loop body won't be as heavily optimized as normal loop,
but it's a good thing.

We have discussed how to verify *well-formed* loops back in 2018.
(BPF control flow, supporting loops and other patterns:
 https://www.linuxplumbersconf.org/event/2/contributions/116/)
Now probably the time to revisit it again!

I think Alexei's proposal is the right direction to have
compiler preserving the loop structure with some pseudo instructions
and verifier is able to range-based verification
instead of iterating all loop iterations.

LLVM already has some IR loop intrinsics like below:
  def int_set_loop_iterations :
  def int_start_loop_iterations :
  def int_test_set_loop_iterations :
  def int_test_start_loop_iterations :
  def int_loop_decrement :
  def int_loop_decrement_reg :
to facilitate hardware loop. BPF target can define its own
intrinsics to help define a well structured loop.

I will look into this.