Re: [PATCH bpf-next v2 2/9] riscv, bpf: add support for far branching

Palmer Dabbelt <palmerdabbelt@xxxxxxxxxx> · Wed, 22 Jan 2020 18:08:57 -0800 (PST)

On Tue, 07 Jan 2020 00:13:56 PST (-0800), Bjorn Topel wrote:
Back from the holidays; Sorry about the delayed reply.

On Mon, 23 Dec 2019 at 19:03, Palmer Dabbelt <palmerdabbelt@xxxxxxxxxx> wrote:

On Mon, 16 Dec 2019 01:13:36 PST (-0800), Bjorn Topel wrote:
> This commit adds branch relaxation to the BPF JIT, and with that
[...]
> @@ -1557,6 +1569,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>  {
>       bool tmp_blinded = false, extra_pass = false;
>       struct bpf_prog *tmp, *orig_prog = prog;
> +     int pass = 0, prev_ninsns = 0, i;
>       struct rv_jit_data *jit_data;
>       struct rv_jit_context *ctx;
>       unsigned int image_size;
> @@ -1596,15 +1609,25 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>               prog = orig_prog;
>               goto out_offset;
>       }
> +     for (i = 0; i < prog->len; i++) {
> +             prev_ninsns += 32;
> +             ctx->offset[i] = prev_ninsns;
> +     }

It feels like the first-order implementation is the same as binutils here: the
first round is worst cased, after which things can be more exact.  We're only
doing one pass in binutils because most of the relaxation happens in the
linker, but this approach seems reasonable to me.  I'd be interested in seeing
some benchmarks, as it may be worth relaxing these in the binutils linker as
well -- I can certainly come up with contrived test cases that aren't relaxed,
but I'm not sure how common this is.

Ah, interesting! Let me try to pull out some branch relaxation
statistics/benchmarks for the BPF selftests.

My only worry is that that invariant should be more explicit.  Specifically,
I'm thinking that every time offset is updated there should be some sort of
assertion that the offset is shrinking.  This is enforced structurally in the
binutils code because we only generate code once and then move it around, but
since you're generating code every time it'd be easy for a bug to sneak in as
the JIT gets more complicated.

Hmm, yes. Maybe use a checksum for the program in addition to the
length invariant, and converge condition would then be prev_len == len
&& prev_crc == crc?

That would work, but it breaks my unfinished optimization below.  I was
thinking something more like "every time offset[i] is updated, check that it
gets smaller and otherwise barf".

Since most of the branches should be forward, you'll probably end up with way
fewer iterations if you do the optimization passes backwards.

Good idea!

> -     /* First pass generates the ctx->offset, but does not emit an image. */
> -     if (build_body(ctx, extra_pass)) {
> -             prog = orig_prog;
> -             goto out_offset;
> +     for (i = 0; i < 16; i++) {
> +             pass++;
> +             ctx->ninsns = 0;
> +             if (build_body(ctx, extra_pass)) {
> +                     prog = orig_prog;
> +                     goto out_offset;

Isn't this returning a broken program if build_body() errors out the first time
through?

Hmm, care to elaborate? I don't see how?

Ya, I don't either any more.  Hopefully I just got confused between prog and
ctx...

> +             }
> +             build_prologue(ctx);
> +             ctx->epilogue_offset = ctx->ninsns;
> +             build_epilogue(ctx);
> +             if (ctx->ninsns == prev_ninsns)
> +                     break;
> +             prev_ninsns = ctx->ninsns;

IDK how important the performance of the JIT is, but you could probably get
away with skipping an iteration by keeping track of some simple metric that
determines if it would be possible to

...to? Given that the programs are getting larger, performance of the
JIT is important. So, any means the number of passes can be reduced is
a good thing!

I guess I meant to say "determines if it would be possible to make any
modifications next time".  I was thinking something along the lines of:

* as you run through the program, keep track of the shortest branch distance
* if you didn't remove enough bytes to make that branch cross a relaxation
 boundary, then you know that next time you won't be able to do any useful
 work

You're already computing all the branch lengths, so it's just an extra min().
Since we're assuming a small number of passes (after reversing the relaxation
direction), you'll probably save more work avoiding the extra pass than it'll
take to compute the extra information.  I guess some sort of benchmark would
give a real answer, but it certainly smells like a good idea ;)

Thanks for the review/suggestions!
Björn