Re: [PATCH bpf-next v3 04/11] bpf: add bpf_for_each_map_elem() helper

Yonghong Song <yhs@xxxxxx> · Thu, 25 Feb 2021 19:22:00 -0800






On 2/25/21 6:16 PM, Yonghong Song wrote:


On 2/25/21 2:41 PM, Andrii Nakryiko wrote:
On Thu, Feb 25, 2021 at 1:35 AM Yonghong Song <yhs@xxxxxx> wrote:

The bpf_for_each_map_elem() helper is introduced which
iterates all map elements with a callback function. The
helper signature looks like
   long bpf_for_each_map_elem(map, callback_fn, callback_ctx, flags)
and for each map element, the callback_fn will be called. For example,
like hashmap, the callback signature may look like
   long callback_fn(map, key, val, callback_ctx)

There are two known use cases for this. One is from upstream ([1]) where
a for_each_map_elem helper may help implement a timeout mechanism
in a more generic way. Another is from our internal discussion
for a firewall use case where a map contains all the rules. The packet
data can be compared to all these rules to decide allow or deny
the packet.

For array maps, users can already use a bounded loop to traverse
elements. Using this helper can avoid using bounded loop. For other
type of maps (e.g., hash maps) where bounded loop is hard or
impossible to use, this helper provides a convenient way to
operate on all elements.

For callback_fn, besides map and map element, a callback_ctx,
allocated on caller stack, is also passed to the callback
function. This callback_ctx argument can provide additional
input and allow to write to caller stack for output.

If the callback_fn returns 0, the helper will iterate through next
element if available. If the callback_fn returns 1, the helper
will stop iterating and returns to the bpf program. Other return
values are not used for now.

Currently, this helper is only available with jit. It is possible
to make it work with interpreter with so effort but I leave it
as the future work.

[1]: 
https://lore.kernel.org/bpf/20210122205415.113822-1-xiyou.wangcong@xxxxxxxxx/ 


Signed-off-by: Yonghong Song <yhs@xxxxxx>
---

It looks good from the perspective of implementation (though I
admittedly lost track of all the insn[0|1].imm transformations). But
see some suggestions below (I hope you can incorporate them).

Overall, though:

Acked-by: Andrii Nakryiko <andrii@xxxxxxxxxx>

  include/linux/bpf.h            |  13 +++
  include/linux/bpf_verifier.h   |   3 +
  include/uapi/linux/bpf.h       |  29 ++++-
  kernel/bpf/bpf_iter.c          |  16 +++
  kernel/bpf/helpers.c           |   2 +
  kernel/bpf/verifier.c          | 208 ++++++++++++++++++++++++++++++---
  kernel/trace/bpf_trace.c       |   2 +
  tools/include/uapi/linux/bpf.h |  29 ++++-
  8 files changed, 287 insertions(+), 15 deletions(-)

[...]

  static int prepare_func_exit(struct bpf_verifier_env *env, int 
*insn_idx)
  {
         struct bpf_verifier_state *state = env->cur_state;
@@ -5400,8 +5487,22 @@ static int prepare_func_exit(struct 
bpf_verifier_env *env, int *insn_idx)

         state->curframe--;
         caller = state->frame[state->curframe];
-       /* return to the caller whatever r0 had in the callee */
-       caller->regs[BPF_REG_0] = *r0;
+       if (callee->in_callback_fn) {
+               /* enforce R0 return value range [0, 1]. */
+               struct tnum range = tnum_range(0, 1);
+
+               if (r0->type != SCALAR_VALUE) {
+                       verbose(env, "R0 not a scalar value\n");
+                       return -EACCES;
+               }
+               if (!tnum_in(range, r0->var_off)) {

if (!tnum_in(tnum_range(0, 1), r0->var_off)) should work as well,
unless you find it less readable (I don't but no strong feeling here)

Will give a try.

Will keep it as is since range is used below for error reporting.




+                       verbose_invalid_scalar(env, r0, &range, 
"callback return", "R0");
+                       return -EINVAL;
+               }
+       } else {
+               /* return to the caller whatever r0 had in the callee */
+               caller->regs[BPF_REG_0] = *r0;
+       }

         /* Transfer references to the caller */
         err = transfer_reference_state(caller, callee);
[...]