Re: [PATCH v2 bpf-next 4/6] bpf: stop setting precise in current state

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 24 Jan 2024 11:06:50 +0100 Stefan Fleischmann <sfle@xxxxxx>
wrote:
> On Fri, 4 Nov 2022 09:36:47 -0700
> Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> 
> > Setting reg->precise to true in current state is not necessary from
> > correctness standpoint, but it does pessimise the whole precision
> > (or rather "imprecision", because that's what we want to keep as
> > much as possible) tracking. Why is somewhat subtle and my best
> > attempt to explain this is recorded in an extensive comment for
> > __mark_chain_precise() function. Some more careful thinking and code
> > reading is probably required still to grok this completely,
> > unfortunately. Whiteboarding and a bunch of extra handwaiving in
> > person would be even more helpful, but is deemed impractical in Git
> > commit.  
> 
> Not sure if this is the preferred way to bring this up, if not please
> direct me elsewhere. I've noticed problems with this patch in the 5.15
> kernel line, originally the Ubuntu kernel. Bug report with more
> information can be found here:
>  https://bugs.launchpad.net/ubuntu/+source/linux-signed/+bug/2050098
> 
> > Next patch pushes this imprecision property even further, building
> > on top of the insights described in this patch.  
> 
> I tracked this down to the changes in upstream version from 5.15.126
> to 127, and have confirmed that reverting this patch and the next
> mentioned here solves our problem.
> 
> > End results are pretty nice, we get reduction in number of total
> > instructions and states verified due to a better states reuse, as
> > some of the states are now more generic and permissive due to less
> > unnecessary precise=true requirements.  
> 
> I'll describe the problem here briefly in case you don't have time to
> read through the Ubuntu bug report. We use Slurm on a cluster and use
> cgroup2 for resource confinement. Now with this change in 5.15 we get
> this error from the slurm daemon on the node:
> 
>  slurmstepd: error: load_ebpf_prog: BPF load error (No space left on
>  device). Please check your system limits (MEMLOCK).
>  (debug log available here:
>  https://launchpadlibrarian.net/710598602/slurmd_cgroup_log.txt)
> 
> And more importantly the cgroup confinement is not working anymore.
> As I said reverting this patch brings back functionality. Now it would
> be easy to blame Slurm here, but I've tested newer kernels, 6.5,
> 6.6.13, 6.7.1 which all work fine. Which makes me believe some crucial
> parts might not have been backported to 5.15.

Sorry for the noise, turns out this is a bug in Slurm that is only
triggered by long bpf logs. So I suppose the newer kernels produce less
logs, hence the issue is not triggered there.

Best regards,
Stefan


> Best regards,
> Stefan
> 
> > SELFTESTS RESULTS
> > =================
> > 
> > $ ./veristat -C -e file,prog,insns,states
> > ~/subprog-precise-results.csv ~/imprecise-early-results.csv | grep
> > -v '+0' File                                     Program
> > Total insns (A)  Total insns (B)  Total insns (DIFF)  Total states
> > (A)  Total states (B)  Total states (DIFF)
> > ---------------------------------------  ----------------------
> > ---------------  ---------------  ------------------
> > ----------------  ----------------  -------------------
> > bpf_iter_ksym.bpf.linked1.o              dump_ksym
> >        347              285       -62 (-17.87%)                20
> >            19          -1 (-5.00%) pyperf600_bpf_loop.bpf.linked1.o
> >       on_event                           3678             3736
> > +58 (+1.58%)               276               285          +9
> > (+3.26%) setget_sockopt.bpf.linked1.o             skops_sockopt
> >       4038             3947        -91 (-2.25%)               347
> >           343          -4 (-1.15%) test_l4lb.bpf.linked1.o
> >       balancer_ingress                   4559             2611
> > -1948 (-42.73%)               118               105        -13
> > (-11.02%) test_l4lb_noinline.bpf.linked1.o         balancer_ingress
> >                 6279             6268        -11 (-0.18%)
> >   237               236          -1 (-0.42%)
> > test_misc_tcp_hdr_options.bpf.linked1.o  misc_estab
> >       1307             1303         -4 (-0.31%)               100
> >            99          -1 (-1.00%) test_sk_lookup.bpf.linked1.o
> >       ctx_narrow_access                   456              447
> >  -9 (-1.97%)                39                38          -1
> > (-2.56%) test_sysctl_loop1.bpf.linked1.o          sysctl_tcp_mem
> >       1389             1384         -5 (-0.36%)                26
> >            25          -1 (-3.85%) test_tc_dtime.bpf.linked1.o
> >       egress_fwdns_prio101                518              485
> > -33 (-6.37%)                51                46          -5
> > (-9.80%) test_tc_dtime.bpf.linked1.o              egress_host
> >        519              468        -51 (-9.83%)                50
> >            44         -6 (-12.00%) test_tc_dtime.bpf.linked1.o
> >       ingress_fwdns_prio101               842             1000
> > +158 (+18.76%)                73                88        +15
> > (+20.55%) xdp_synproxy_kern.bpf.linked1.o          syncookie_tc
> >               405757           373173     -32584 (-8.03%)
> > 25735             22882      -2853 (-11.09%)
> > xdp_synproxy_kern.bpf.linked1.o          syncookie_xdp
> >     479055           371590   -107465 (-22.43%)             29145
> >         22207      -6938 (-23.81%)
> > ---------------------------------------  ----------------------
> > ---------------  ---------------  ------------------
> > ----------------  ----------------  -------------------
> > 
> > Slight regression in
> > test_tc_dtime.bpf.linked1.o/ingress_fwdns_prio101 is left for a
> > follow up, there might be some more precision-related bugs in
> > existing BPF verifier logic.
> > 
> > CILIUM RESULTS
> > ==============
> > 
> > $ ./veristat -C -e file,prog,insns,states
> > ~/subprog-precise-results-cilium.csv
> > ~/imprecise-early-results-cilium.csv | grep -v '+0' File
> > Program                         Total insns (A)  Total insns (B)
> > Total insns (DIFF)  Total states (A)  Total states (B)  Total states
> > (DIFF) -------------  ------------------------------
> > --------------- ---------------  ------------------
> > ---------------- ----------------  ------------------- bpf_host.o
> >   cil_from_host 762              556      -206 (-27.03%)
> >              43                37         -6 (-13.95%) bpf_host.o
> > tail_handle_nat_fwd_ipv4                  23541            23426
> >  -115 (-0.49%)              1538              1537          -1
> > (-0.07%) bpf_host.o     tail_nodeport_nat_egress_ipv4
> > 33592            33566        -26 (-0.08%)              2163
> >     2161          -2 (-0.09%) bpf_lxc.o
> > tail_handle_nat_fwd_ipv4 23541            23426       -115 (-0.49%)
> >    1538              1537          -1 (-0.07%) bpf_overlay.o
> > tail_nodeport_nat_egress_ipv4             33581            33543
> >   -38 (-0.11%)              2160              2157          -3
> > (-0.14%) bpf_xdp.o      tail_handle_nat_fwd_ipv4
> > 21659            20920       -739 (-3.41%)              1440
> >     1376         -64 (-4.44%) bpf_xdp.o
> > tail_handle_nat_fwd_ipv6 17084            17039        -45 (-0.26%)
> >     907               905          -2 (-0.22%) bpf_xdp.o
> > tail_lb_ipv4                              73442            73430
> >   -12 (-0.02%)              4370              4369          -1
> > (-0.02%) bpf_xdp.o      tail_lb_ipv6
> > 152114           151895       -219 (-0.14%)              6493
> >      6479         -14 (-0.22%) bpf_xdp.o
> > tail_nodeport_nat_egress_ipv4             17377            17200
> >  -177 (-1.02%)              1125              1111         -14
> > (-1.24%) bpf_xdp.o      tail_nodeport_nat_ingress_ipv6
> > 6405             6397         -8 (-0.12%)               309
> >     308          -1 (-0.32%) bpf_xdp.o      tail_rev_nodeport_lb4
> >                  7126             6934       -192 (-2.69%)
> >    414               402         -12 (-2.90%) bpf_xdp.o
> > tail_rev_nodeport_lb6                     18059            17905
> >  -154 (-0.85%)              1105              1096          -9
> > (-0.81%) -------------  ------------------------------
> > ---------------  ---------------  ------------------
> > ----------------  ----------------  -------------------
> > 
> > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> > ---
> >  kernel/bpf/verifier.c | 103
> > +++++++++++++++++++++++++++++++++++++----- 1 file changed, 91
> > insertions(+), 12 deletions(-)
> > 
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index c1169ee1bc7c..ff3fc21ce99b 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -2749,8 +2749,11 @@ static void mark_all_scalars_precise(struct
> > bpf_verifier_env *env, 
> >  	/* big hammer: mark all scalars precise in this path.
> >  	 * pop_stack may still get !precise scalars.
> > +	 * We also skip current state and go straight to first
> > parent state,
> > +	 * because precision markings in current non-checkpointed
> > state are
> > +	 * not needed. See why in the comment in
> > __mark_chain_precision below. */
> > -	for (; st; st = st->parent)
> > +	for (st = st->parent; st; st = st->parent) {
> >  		for (i = 0; i <= st->curframe; i++) {
> >  			func = st->frame[i];
> >  			for (j = 0; j < BPF_REG_FP; j++) {
> > @@ -2768,8 +2771,88 @@ static void mark_all_scalars_precise(struct
> > bpf_verifier_env *env, reg->precise = true;
> >  			}
> >  		}
> > +	}
> >  }
> >  
> > +/*
> > + * __mark_chain_precision() backtracks BPF program instruction
> > sequence and
> > + * chain of verifier states making sure that register *regno* (if
> > regno >= 0)
> > + * and/or stack slot *spi* (if spi >= 0) are marked as precisely
> > tracked
> > + * SCALARS, as well as any other registers and slots that
> > contribute to
> > + * a tracked state of given registers/stack slots, depending on
> > specific BPF
> > + * assembly instructions (see backtrack_insns() for exact
> > instruction handling
> > + * logic). This backtracking relies on recorded jmp_history and is
> > able to
> > + * traverse entire chain of parent states. This process ends only
> > when all the
> > + * necessary registers/slots and their transitive dependencies are
> > marked as
> > + * precise.
> > + *
> > + * One important and subtle aspect is that precise marks *do not
> > matter* in
> > + * the currently verified state (current state). It is important to
> > understand
> > + * why this is the case.
> > + *
> > + * First, note that current state is the state that is not yet
> > "checkpointed",
> > + * i.e., it is not yet put into env->explored_states, and it has no
> > children
> > + * states as well. It's ephemeral, and can end up either a) being
> > discarded if
> > + * compatible explored state is found at some point or BPF_EXIT
> > instruction is
> > + * reached or b) checkpointed and put into env->explored_states,
> > branching out
> > + * into one or more children states.
> > + *
> > + * In the former case, precise markings in current state are
> > completely
> > + * ignored by state comparison code (see regsafe() for details).
> > Only
> > + * checkpointed ("old") state precise markings are important, and
> > if old
> > + * state's register/slot is precise, regsafe() assumes current
> > state's
> > + * register/slot as precise and checks value ranges exactly and
> > precisely. If
> > + * states turn out to be compatible, current state's necessary
> > precise
> > + * markings and any required parent states' precise markings are
> > enforced
> > + * after the fact with propagate_precision() logic, after the fact.
> > But it's
> > + * important to realize that in this case, even after marking
> > current state
> > + * registers/slots as precise, we immediately discard current
> > state. So what
> > + * actually matters is any of the precise markings propagated into
> > current
> > + * state's parent states, which are always checkpointed (due to b)
> > case above).
> > + * As such, for scenario a) it doesn't matter if current state has
> > precise
> > + * markings set or not.
> > + *
> > + * Now, for the scenario b), checkpointing and forking into
> > child(ren)
> > + * state(s). Note that before current state gets to checkpointing
> > step, any
> > + * processed instruction always assumes precise SCALAR
> > register/slot
> > + * knowledge: if precise value or range is useful to prune jump
> > branch, BPF
> > + * verifier takes this opportunity enthusiastically. Similarly,
> > when
> > + * register's value is used to calculate offset or memory address,
> > exact
> > + * knowledge of SCALAR range is assumed, checked, and enforced. So,
> > similar to
> > + * what we mentioned above about state comparison ignoring precise
> > markings
> > + * during state comparison, BPF verifier ignores and also assumes
> > precise
> > + * markings *at will* during instruction verification process. But
> > as verifier
> > + * assumes precision, it also propagates any precision dependencies
> > across
> > + * parent states, which are not yet finalized, so can be further
> > restricted
> > + * based on new knowledge gained from restrictions enforced by
> > their children
> > + * states. This is so that once those parent states are finalized,
> > i.e., when
> > + * they have no more active children state, state comparison logic
> > in
> > + * is_state_visited() would enforce strict and precise SCALAR
> > ranges, if
> > + * required for correctness.
> > + *
> > + * To build a bit more intuition, note also that once a state is
> > checkpointed,
> > + * the path we took to get to that state is not important. This is
> > crucial
> > + * property for state pruning. When state is checkpointed and
> > finalized at
> > + * some instruction index, it can be correctly and safely used to
> > "short
> > + * circuit" any *compatible* state that reaches exactly the same
> > instruction
> > + * index. I.e., if we jumped to that instruction from a completely
> > different
> > + * code path than original finalized state was derived from, it
> > doesn't
> > + * matter, current state can be discarded because from that
> > instruction
> > + * forward having a compatible state will ensure we will safely
> > reach the
> > + * exit. States describe preconditions for further exploration, but
> > completely
> > + * forget the history of how we got here.
> > + *
> > + * This also means that even if we needed precise SCALAR range to
> > get to
> > + * finalized state, but from that point forward *that same* SCALAR
> > register is
> > + * never used in a precise context (i.e., it's precise value is not
> > needed for
> > + * correctness), it's correct and safe to mark such register as
> > "imprecise"
> > + * (i.e., precise marking set to false). This is what we rely on
> > when we do
> > + * not set precise marking in current state. If no child state
> > requires
> > + * precision for any given SCALAR register, it's safe to dictate
> > that it can
> > + * be imprecise. If any child state does require this register to
> > be precise,
> > + * we'll mark it precise later retroactively during precise
> > markings
> > + * propagation from child state to parent states.
> > + */
> >  static int __mark_chain_precision(struct bpf_verifier_env *env, int
> > frame, int regno, int spi)
> >  {
> > @@ -2787,6 +2870,10 @@ static int __mark_chain_precision(struct
> > bpf_verifier_env *env, int frame, int r if (!env->bpf_capable)
> >  		return 0;
> >  
> > +	/* Do sanity checks against current state of register
> > and/or stack
> > +	 * slot, but don't set precise flag in current state, as
> > precision
> > +	 * tracking in the current state is unnecessary.
> > +	 */
> >  	func = st->frame[frame];
> >  	if (regno >= 0) {
> >  		reg = &func->regs[regno];
> > @@ -2794,11 +2881,7 @@ static int __mark_chain_precision(struct
> > bpf_verifier_env *env, int frame, int r WARN_ONCE(1, "backtracing
> > misuse"); return -EFAULT;
> >  		}
> > -		if (!reg->precise)
> > -			new_marks = true;
> > -		else
> > -			reg_mask = 0;
> > -		reg->precise = true;
> > +		new_marks = true;
> >  	}
> >  
> >  	while (spi >= 0) {
> > @@ -2811,11 +2894,7 @@ static int __mark_chain_precision(struct
> > bpf_verifier_env *env, int frame, int r stack_mask = 0;
> >  			break;
> >  		}
> > -		if (!reg->precise)
> > -			new_marks = true;
> > -		else
> > -			stack_mask = 0;
> > -		reg->precise = true;
> > +		new_marks = true;
> >  		break;
> >  	}
> >  
> > @@ -11534,7 +11613,7 @@ static bool regsafe(struct bpf_verifier_env
> > *env, struct bpf_reg_state *rold, if (env->explore_alu_limits)
> >  			return false;
> >  		if (rcur->type == SCALAR_VALUE) {
> > -			if (!rold->precise && !rcur->precise)
> > +			if (!rold->precise)
> >  				return true;
> >  			/* new val must satisfy old val knowledge
> > */ return range_within(rold, rcur) &&  
> 




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux