On Wed, May 6, 2020 at 2:42 PM Yonghong Song <yhs@xxxxxx> wrote: > > > > On 5/6/20 10:37 AM, Andrii Nakryiko wrote: > > On Sun, May 3, 2020 at 11:26 PM Yonghong Song <yhs@xxxxxx> wrote: > >> > >> Two helpers bpf_seq_printf and bpf_seq_write, are added for > >> writing data to the seq_file buffer. > >> > >> bpf_seq_printf supports common format string flag/width/type > >> fields so at least I can get identical results for > >> netlink and ipv6_route targets. > >> > > > > Does seq_printf() has its own format string specification? Is there > > any documentation explaining? I was confused by few different checks > > below... > > Not really. Similar to bpf_trace_printk(), since we need to > parse format string, so we may only support a subset of > what seq_printf() does. But we should not invent new > formats. > > > > >> For bpf_seq_printf and bpf_seq_write, return value -EOVERFLOW > >> specifically indicates a write failure due to overflow, which > >> means the object will be repeated in the next bpf invocation > >> if object collection stays the same. Note that if the object > >> collection is changed, depending how collection traversal is > >> done, even if the object still in the collection, it may not > >> be visited. > >> > >> bpf_seq_printf may return -EBUSY meaning that internal percpu > >> buffer for memory copy of strings or other pointees is > >> not available. Bpf program can return 1 to indicate it > >> wants the same object to be repeated. Right now, this should not > >> happen on no-RT kernels since migrate_enable(), which guards > >> bpf prog call, calls preempt_enable(). > > > > You probably meant migrate_disable()/preempt_disable(), right? But > > Yes, sorry for typo. > > > could it still happen, at least due to NMI? E.g., perf_event BPF > > program gets triggered during bpf_iter program execution? I think for > > perf_event_output function, we have 3 levels, for one of each possible > > "contexts"? Should we do something like that here as well? > > Currently bpf_seq_printf() and bpf_seq_write() helpers can > only be called by iter bpf programs. The iter bpf program can only > be run on process context as it is triggered by a read() syscall. > So one level should be enough for non-RT kernel. > > For RT kernel, migrate_disable does not prevent preemption, > so it is possible task in the middle of bpf_seq_printf() might > be preempted, so I implemented the logic to return -EBUSY. > I think this case should be extremely rare so I only implemented > one level nesting. yeah, makes sense > > > > >> > >> Signed-off-by: Yonghong Song <yhs@xxxxxx> > >> --- > >> include/uapi/linux/bpf.h | 32 +++++- > >> kernel/trace/bpf_trace.c | 195 +++++++++++++++++++++++++++++++++ > >> scripts/bpf_helpers_doc.py | 2 + > >> tools/include/uapi/linux/bpf.h | 32 +++++- > >> 4 files changed, 259 insertions(+), 2 deletions(-) > >> > >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > >> index 97ceb0f2e539..e440a9d5cca2 100644 > >> --- a/include/uapi/linux/bpf.h > >> +++ b/include/uapi/linux/bpf.h > >> @@ -3076,6 +3076,34 @@ union bpf_attr { > >> * See: clock_gettime(CLOCK_BOOTTIME) > >> * Return > >> * Current *ktime*. > >> + * > > > > [...] > > > >> +BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size, > >> + const void *, data, u32, data_len) > >> +{ > >> + int err = -EINVAL, fmt_cnt = 0, memcpy_cnt = 0; > >> + int i, buf_used, copy_size, num_args; > >> + u64 params[MAX_SEQ_PRINTF_VARARGS]; > >> + struct bpf_seq_printf_buf *bufs; > >> + const u64 *args = data; > >> + > >> + buf_used = this_cpu_inc_return(bpf_seq_printf_buf_used); > >> + if (WARN_ON_ONCE(buf_used > 1)) { > >> + err = -EBUSY; > >> + goto out; > >> + } > >> + > >> + bufs = this_cpu_ptr(&bpf_seq_printf_buf); > >> + > >> + /* > >> + * bpf_check()->check_func_arg()->check_stack_boundary() > >> + * guarantees that fmt points to bpf program stack, > >> + * fmt_size bytes of it were initialized and fmt_size > 0 > >> + */ > >> + if (fmt[--fmt_size] != 0) > > > > If we allow fmt_size == 0, this will need to be changed. > > Currently, we do not support fmt_size == 0. Yes, if we allow, this > needs change. > > > > >> + goto out; > >> + > >> + if (data_len & 7) > >> + goto out; > >> + > >> + for (i = 0; i < fmt_size; i++) { > >> + if (fmt[i] == '%' && (!data || !data_len)) > > > > So %% escaping is not supported? > > Yes, have not seen a need yet my ipv6_route/netlink example. > Can certain add if there is a use case. I can imagine this being used quite often when trying to print out percentages... Would just suck to have to upgrade kernel just to be able to print % character :) > > > > >> + goto out; > >> + } > >> + > >> + num_args = data_len / 8; > >> + > >> + /* check format string for allowed specifiers */ > >> + for (i = 0; i < fmt_size; i++) { > >> + if ((!isprint(fmt[i]) && !isspace(fmt[i])) || !isascii(fmt[i])) > > > > why these restrictions? are they essential? > > This is the same restriction in bpf_trace_printk(). I guess the purpose > is to avoid weird print. To promote bpf_iter to dump beyond asscii, I > guess we can remove this restriction. well, if underlying seq_printf() will fail for those "more liberal" strings, then that would be bad. Basically, we should try to not impose additional restrictions compared to seq_printf, but also no need to allow more, which will be rejected by it. I haven't checked seq_printf implementation though, so don't know what are those restrictions. > > > > >> + goto out; > >> + > >> + if (fmt[i] != '%') > >> + continue; > >> + > >> + if (fmt_cnt >= MAX_SEQ_PRINTF_VARARGS) { > >> + err = -E2BIG; > >> + goto out; > >> + } > >> + > >> + if (fmt_cnt >= num_args) > >> + goto out; > >> + > >> + /* fmt[i] != 0 && fmt[last] == 0, so we can access fmt[i + 1] */ > >> + i++; > >> + > >> + /* skip optional "[0+-][num]" width formating field */ > >> + while (fmt[i] == '0' || fmt[i] == '+' || fmt[i] == '-') > > > > There could be space as well, as an alternative to 0. > > We can allow space. But '0' is used more common, right? I use both space and 0 quite often, space especially with aligned strings. > > > > >> + i++; > >> + if (fmt[i] >= '1' && fmt[i] <= '9') { > >> + i++; > >> + while (fmt[i] >= '0' && fmt[i] <= '9') > >> + i++; > >> + } > >> + > >> + if (fmt[i] == 's') { > >> + /* disallow any further format extensions */ > >> + if (fmt[i + 1] != 0 && > >> + !isspace(fmt[i + 1]) && > >> + !ispunct(fmt[i + 1])) > >> + goto out; > > > > I'm not sure I follow this check either. printf("%sbla", "whatever") > > is a perfectly fine format string. Unless seq_printf has some > > additional restrictions? > > Yes, just some restriction inherited from bpf_trace_printk(). > Will remove. see comment above, if we allow it here, but seq_printf() will reject it, then there is no point > > > > >> + > >> + /* try our best to copy */ > >> + if (memcpy_cnt >= MAX_SEQ_PRINTF_MAX_MEMCPY) { > >> + err = -E2BIG; > >> + goto out; > >> + } > >> + > > > > [...] > > > >> + > >> +static int bpf_seq_printf_btf_ids[5]; > >> +static const struct bpf_func_proto bpf_seq_printf_proto = { > >> + .func = bpf_seq_printf, > >> + .gpl_only = true, > >> + .ret_type = RET_INTEGER, > >> + .arg1_type = ARG_PTR_TO_BTF_ID, > >> + .arg2_type = ARG_PTR_TO_MEM, > >> + .arg3_type = ARG_CONST_SIZE, > > > > It feels like allowing zero shouldn't hurt too much? > > This is the format string, I would prefer to keep it non-zero. yeah, makes sense, I suppose > > > > >> + .arg4_type = ARG_PTR_TO_MEM_OR_NULL, > >> + .arg5_type = ARG_CONST_SIZE_OR_ZERO, > >> + .btf_id = bpf_seq_printf_btf_ids, > >> +}; > >> + > >> +BPF_CALL_3(bpf_seq_write, struct seq_file *, m, const void *, data, u32, len) > >> +{ > >> + return seq_write(m, data, len) ? -EOVERFLOW : 0; > >> +} > >> + > >> +static int bpf_seq_write_btf_ids[5]; > >> +static const struct bpf_func_proto bpf_seq_write_proto = { > >> + .func = bpf_seq_write, > >> + .gpl_only = true, > >> + .ret_type = RET_INTEGER, > >> + .arg1_type = ARG_PTR_TO_BTF_ID, > >> + .arg2_type = ARG_PTR_TO_MEM, > >> + .arg3_type = ARG_CONST_SIZE, > > > > Same, ARG_CONST_SIZE_OR_ZERO? > > This one, possible. Let me check. I just remember how much trouble perf_event_output() was causing me because it enforced >0 for data length. Even though my variable-sized output was always >0, proving that to (especially older) verifier was extremely hard. So the less unnecessary restrictions - the better. > > > > >> + .btf_id = bpf_seq_write_btf_ids, > >> +}; > >> + > > > > [...] > >