Re: [RFC bpf-next] Speedup for verifier.c:do_misc_fixups

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Wed, 22 Jun 2022 20:11:01 -0700

On Wed, Jun 22, 2022 at 02:30:21AM +0300, Eduard Zingerman wrote:
> Hi Everyone,
> 
> Below I describe a scenario when function `verifier.c:do_misc_fixups`
> exhibits O(n**2) performance for certain BPF programs. I also describe
> a possible adjustment for the `verifier.c:do_misc_fixups` to avoid
> such behavior. I can work on this adjustment in my spare time for a
> few weeks if the community is interested.
> 
> The function `verifier.c:do_misc_fixups` uses
> `verifier.c:bpf_patch_insn_data` to replace single instructions with a
> series of instructions. The `verifier.c:bpf_patch_insn_data` is a
> wrapper for the function `core.c:bpf_patch_insn_single`. The latter
> function operates in the following manner:
> - allocates new instructions array of size `old size + patch size`;
> - copies old instructions;
> - copies patch instructions;
> - adjusts offset fields for all instructions.
> 
> This algorithm means that for pathological BPF programs the
> `verifier.c:do_misc_fixups` would demonstrate running time
> proportional to the square of the program length.
> An example of such program looks as follows:
> 
> BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_jiffies64);
> ... N times ...
> BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_jiffies64);
> BPF_ALU64_IMM(BPF_MOV, BPF_REG_0, 0);
> BPF_EXIT_INSN();
> 
> `verifier.c:do_misc_fixups` replaces each call to jiffies by 3
> instructions. Hence the code above would lead to the copying of
> (N + N+2 + N+4 ... + N+2N) bytes, which is O(n**2).
> 
> Experimental results confirm this observation.  Here is the output of
> the demo program:
> 
>   prog len     verif time usec
>        128          1461
>        256 x2       4132 x2.8
>        512 x2      12510 x3.0
>       1024 x2      44022 x3.5
>       2048 x2     170479 x3.9
>       4096 x2     646766 x3.8
>       8192 x2    2557379 x4.0
> 
> The source code for this program is at the end of this email.
> 
> The following technique could be used to improve the running time of
> the `verifier.c:do_misc_fixups`:
> - patches are not applied immediately but are accumulated;
> - patches are stored in the intermediate array; the size of this array
>   is doubled when the capacity for the new patch is insufficient;
> - patches are applied at once at the end of the `verifier.c:do_misc_fixups`;
> - intermediate data structure is constructed to efficiently map
>   between old and new instruction indexes;
> - instruction offset fields are updated using this data structure.
> 
> In terms of the C code, this could look as follows:
> 
> /* Describes a single patch:
>  * BPF instruction at index `offset` is replaced by
>  * a series of the instructions pointed to by `insns`.
>  */
> struct bpf_patch {
>   int offset;             /* index inside the original program,
>                            * the instruction at this index would be replaced.
>                            */
>   int insns_count;        /* size of the patch */
>   int delta;              /* difference in indexes between original program and
>                            * new program after application of all patches up to
>                            * and including this one.
>                            */
>   struct bpf_insn *insns; /* patch instructions */
> };
> 
> /* Used to accumulate patches */
> struct bpf_patching_state {
>   struct bpf_patch *patches; /* array of all patches*/
>   int patches_count;         /* current amount of patches */
>   int patches_capacity;      /* total capacity of the patches array */
>   struct bpf_insn *insns;    /* array of patch instructions,
>                               * bpf_patch::insns points to the middle of this array
>                               */
>   int insns_count;           /* first free index in the instructions array */
>   int insns_capacity;        /* total capacity of the instructions array */
> };
> 
> /* Allocates `patches` and `insns` arrays with some initial capacity */
> static int init_patching_state(struct bpf_patching_state *state)
> { ... }
> 
> /* Adds a patch record to the `bpf_patching_state::patches` array.
>  * If array capacity is insufficient, its size is doubled.
>  * Copies `insns` to the end of the `bpf_patching_state::insns`.
>  * If array capacity is insufficient, its size is doubled.
>  *
>  * It is assumed that patches are added in increasing order of
>  * the `bpf_patch::offset` field.
>  */
> static int add_patch(struct bpf_patching_state *state,
>                      int offset,
>                      int insns_count,
>                      struct bpf_insn *insns)
> { ... }
> 
> /* - fills in the `bpf_patch::delta` fields for all patches in `state`.
>  * - allocates new program
>  * - copies program and patch instructions using the `patch_and_copy_insn` function
>  */
> static struct bpf_insn* apply_patches(struct bpf_patching_state *state,
>                                       struct bpf_insn *prog,
>                                       int *prog_size) {
> {
>   int delta = 0;
>   for (int i = 0; i < state->patches_count; ++i) {
>     delta += state->patches[i].insns_count - 1;
>     state->patches[i].delta = delta;
>   }
>   ...
> }
> 
> /* Uses binary search to find `bpf_patch::delta` corresponding to `index`.
>  * `index` stands for the index of instruction in the original program.
>  * Looks for the closest `state->patches[?]->offset <= index` and returns it's `delta`.
>  */
> static int lookup_delta_for_index(struct bpf_patching_state *state, int index)
> { ... }
> 
> /* Copies instruction at `src` to instruction at `dst` and adjusts `dst->off` field.
>  * `pc` stands for the instruction index of `src` in the original program.
>  */
> static void patch_and_copy_insn(struct bpf_patching_state *state,
>                                 int pc,
>                                 struct bpf_insn *dst,
>                                 struct bpf_insn *src)
> {
>   int new_pc = pc + lookup_delta_for_index(state, pc);
>   int new_dst = pc + dst->off + lookup_delta_for_index(state, pc + dst->off);
> 
>   memcpy(dst, src, sizeof(struct bpf_insn));
>   dst->off = new_dst - new_pc;
> }
> 
> Please let me know what you think about this change and whether or not
> it should be implemented.

tbh sounds scary. We had so many bugs in the patch_insn over years.
The code is subtle.
The main overhead is from bpf_adj_branches() as you noticed.
We might even do it twice for programs larger than S16_MAX resulting
in O(n^3) behavior.
There is also bpf_adj_linfo, adjust_insn_aux_data, adjust_subprog_starts,
adjust_poke_descs.
Every time we patch.
It's certainly something worth optimizing if we can.
Before proceeding too far based on artificial test please collect
the number of patches and their sizes we currently do across all progs
in selftests. Maybe it's not that bad in real code.

As far as algo the patch_and_copy_insn() assumes that 'dst' insn is a branch?
I guess I'm missing some parts of the proposed algo.

Instead of delaying everything maybe we can do all patching inline
except bpf_adj_branches?
Remember only:
   int offset;             /* index inside the original program,
                            * the instruction at this index would be replaced.
                            */
   int insns_count;        /* size of the patch */
   int delta;              /* difference in indexes between original program and
                            * new program after application of all patches up to
                            * and including this one.
                            */
and do single bpf_adj_branches at the end ?