Re: [PATCH bpf-next v2 13/20] bpf: add bpf_seq_printf and bpf_seq_write helpers

Yonghong Song <yhs@xxxxxx> · Wed, 6 May 2020 14:42:15 -0700

On 5/6/20 10:37 AM, Andrii Nakryiko wrote:
On Sun, May 3, 2020 at 11:26 PM Yonghong Song <yhs@xxxxxx> wrote:

Two helpers bpf_seq_printf and bpf_seq_write, are added for
writing data to the seq_file buffer.

bpf_seq_printf supports common format string flag/width/type
fields so at least I can get identical results for
netlink and ipv6_route targets.


Does seq_printf() has its own format string specification? Is there
any documentation explaining? I was confused by few different checks
below...

Not really. Similar to bpf_trace_printk(), since we need to
parse format string, so we may only support a subset of
what seq_printf() does. But we should not invent new
formats.


For bpf_seq_printf and bpf_seq_write, return value -EOVERFLOW
specifically indicates a write failure due to overflow, which
means the object will be repeated in the next bpf invocation
if object collection stays the same. Note that if the object
collection is changed, depending how collection traversal is
done, even if the object still in the collection, it may not
be visited.

bpf_seq_printf may return -EBUSY meaning that internal percpu
buffer for memory copy of strings or other pointees is
not available. Bpf program can return 1 to indicate it
wants the same object to be repeated. Right now, this should not
happen on no-RT kernels since migrate_enable(), which guards
bpf prog call, calls preempt_enable().

You probably meant migrate_disable()/preempt_disable(), right? But

Yes, sorry for typo.

could it still happen, at least due to NMI? E.g., perf_event BPF
program gets triggered during bpf_iter program execution? I think for
perf_event_output function, we have 3 levels, for one of each possible
"contexts"? Should we do something like that here as well?

Currently bpf_seq_printf() and bpf_seq_write() helpers can
only be called by iter bpf programs. The iter bpf program can only
be run on process context as it is triggered by a read() syscall.
So one level should be enough for non-RT kernel.

For RT kernel, migrate_disable does not prevent preemption,
so it is possible task in the middle of bpf_seq_printf() might
be preempted, so I implemented the logic to return -EBUSY.
I think this case should be extremely rare so I only implemented
one level nesting.



Signed-off-by: Yonghong Song <yhs@xxxxxx>
---
  include/uapi/linux/bpf.h       |  32 +++++-
  kernel/trace/bpf_trace.c       | 195 +++++++++++++++++++++++++++++++++
  scripts/bpf_helpers_doc.py     |   2 +
  tools/include/uapi/linux/bpf.h |  32 +++++-
  4 files changed, 259 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 97ceb0f2e539..e440a9d5cca2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3076,6 +3076,34 @@ union bpf_attr {
   *             See: clock_gettime(CLOCK_BOOTTIME)
   *     Return
   *             Current *ktime*.
+ *

[...]

+BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size,
+          const void *, data, u32, data_len)
+{
+       int err = -EINVAL, fmt_cnt = 0, memcpy_cnt = 0;
+       int i, buf_used, copy_size, num_args;
+       u64 params[MAX_SEQ_PRINTF_VARARGS];
+       struct bpf_seq_printf_buf *bufs;
+       const u64 *args = data;
+
+       buf_used = this_cpu_inc_return(bpf_seq_printf_buf_used);
+       if (WARN_ON_ONCE(buf_used > 1)) {
+               err = -EBUSY;
+               goto out;
+       }
+
+       bufs = this_cpu_ptr(&bpf_seq_printf_buf);
+
+       /*
+        * bpf_check()->check_func_arg()->check_stack_boundary()
+        * guarantees that fmt points to bpf program stack,
+        * fmt_size bytes of it were initialized and fmt_size > 0
+        */
+       if (fmt[--fmt_size] != 0)

If we allow fmt_size == 0, this will need to be changed.

Currently, we do not support fmt_size == 0. Yes, if we allow, this
needs change.


+               goto out;
+
+       if (data_len & 7)
+               goto out;
+
+       for (i = 0; i < fmt_size; i++) {
+               if (fmt[i] == '%' && (!data || !data_len))

So %% escaping is not supported?

Yes, have not seen a need yet my ipv6_route/netlink example.
Can certain add if there is a use case.


+                       goto out;
+       }
+
+       num_args = data_len / 8;
+
+       /* check format string for allowed specifiers */
+       for (i = 0; i < fmt_size; i++) {
+               if ((!isprint(fmt[i]) && !isspace(fmt[i])) || !isascii(fmt[i]))

why these restrictions? are they essential?

This is the same restriction in bpf_trace_printk(). I guess the purpose
is to avoid weird print. To promote bpf_iter to dump beyond asscii, I 
guess we can remove this restriction.


+                       goto out;
+
+               if (fmt[i] != '%')
+                       continue;
+
+               if (fmt_cnt >= MAX_SEQ_PRINTF_VARARGS) {
+                       err = -E2BIG;
+                       goto out;
+               }
+
+               if (fmt_cnt >= num_args)
+                       goto out;
+
+               /* fmt[i] != 0 && fmt[last] == 0, so we can access fmt[i + 1] */
+               i++;
+
+               /* skip optional "[0+-][num]" width formating field */
+               while (fmt[i] == '0' || fmt[i] == '+'  || fmt[i] == '-')

There could be space as well, as an alternative to 0.

We can allow space. But '0' is used more common, right?


+                       i++;
+               if (fmt[i] >= '1' && fmt[i] <= '9') {
+                       i++;
+                       while (fmt[i] >= '0' && fmt[i] <= '9')
+                               i++;
+               }
+
+               if (fmt[i] == 's') {
+                       /* disallow any further format extensions */
+                       if (fmt[i + 1] != 0 &&
+                           !isspace(fmt[i + 1]) &&
+                           !ispunct(fmt[i + 1]))
+                               goto out;

I'm not sure I follow this check either. printf("%sbla", "whatever")
is a perfectly fine format string. Unless seq_printf has some
additional restrictions?

Yes, just some restriction inherited from bpf_trace_printk().
Will remove.


+
+                       /* try our best to copy */
+                       if (memcpy_cnt >= MAX_SEQ_PRINTF_MAX_MEMCPY) {
+                               err = -E2BIG;
+                               goto out;
+                       }
+

[...]

+
+static int bpf_seq_printf_btf_ids[5];
+static const struct bpf_func_proto bpf_seq_printf_proto = {
+       .func           = bpf_seq_printf,
+       .gpl_only       = true,
+       .ret_type       = RET_INTEGER,
+       .arg1_type      = ARG_PTR_TO_BTF_ID,
+       .arg2_type      = ARG_PTR_TO_MEM,
+       .arg3_type      = ARG_CONST_SIZE,

It feels like allowing zero shouldn't hurt too much?

This is the format string, I would prefer to keep it non-zero.


+       .arg4_type      = ARG_PTR_TO_MEM_OR_NULL,
+       .arg5_type      = ARG_CONST_SIZE_OR_ZERO,
+       .btf_id         = bpf_seq_printf_btf_ids,
+};
+
+BPF_CALL_3(bpf_seq_write, struct seq_file *, m, const void *, data, u32, len)
+{
+       return seq_write(m, data, len) ? -EOVERFLOW : 0;
+}
+
+static int bpf_seq_write_btf_ids[5];
+static const struct bpf_func_proto bpf_seq_write_proto = {
+       .func           = bpf_seq_write,
+       .gpl_only       = true,
+       .ret_type       = RET_INTEGER,
+       .arg1_type      = ARG_PTR_TO_BTF_ID,
+       .arg2_type      = ARG_PTR_TO_MEM,
+       .arg3_type      = ARG_CONST_SIZE,

Same, ARG_CONST_SIZE_OR_ZERO?

This one, possible. Let me check.


+       .btf_id         = bpf_seq_write_btf_ids,
+};
+

[...]