Re: BPF logging infrastructure. Was: [PATCH bpf-next 4/6] tools: add new members to bpf_attr.raw_tracepoint in bpf.h

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Tue, 14 Jul 2020 14:58:02 -0700

On Tue, Jul 14, 2020 at 1:47 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes:
>
> > On Tue, Jul 14, 2020 at 5:12 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
> >>
> >> Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes:
> >>
> >> > On Mon, Jul 13, 2020 at 1:13 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
> >> >>
> >> >> From: Toke Høiland-Jørgensen <toke@xxxxxxxxxx>
> >> >>
> >> >> Sync addition of new members from main kernel tree.
> >> >>
> >> >> Signed-off-by: Toke Høiland-Jørgensen <toke@xxxxxxxxxx>
> >> >> ---
> >> >>  tools/include/uapi/linux/bpf.h |    9 +++++++--
> >> >>  1 file changed, 7 insertions(+), 2 deletions(-)
> >> >>
> >> >> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> >> >> index da9bf35a26f8..662a15e4a1a1 100644
> >> >> --- a/tools/include/uapi/linux/bpf.h
> >> >> +++ b/tools/include/uapi/linux/bpf.h
> >> >> @@ -573,8 +573,13 @@ union bpf_attr {
> >> >>         } query;
> >> >>
> >> >>         struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */
> >> >> -               __u64 name;
> >> >> -               __u32 prog_fd;
> >> >> +               __u64           name;
> >> >> +               __u32           prog_fd;
> >> >> +               __u32           log_level;      /* verbosity level of log */
> >> >> +               __u32           log_size;       /* size of user buffer */
> >> >> +               __aligned_u64   log_buf;        /* user supplied buffer */
> >> >> +               __u32           tgt_prog_fd;
> >> >> +               __u32           tgt_btf_id;
> >> >>         } raw_tracepoint;
> >> >>
> >> >>         struct { /* anonymous struct for BPF_BTF_LOAD */
> >> >>
> >> >
> >> > I think BPF syscall would benefit from common/generalized log support
> >> > across all commands, given how powerful/complex it already is.
> >> > Sometimes it's literally impossible to understand why one gets -EINVAL
> >> > without adding printk()s in the kernel.
> >>
> >> Yes, I agree! This is horrible UI!
> >
> > UI?.. It's a perfectly fine and extensible API for all functionality
> > it provides, it just needs a better human-readable feedback mechanism,
> > which is what I'm proposing. Error codes are not working when you have
> > so many different situations that can result in error.
>
> Yes. I was agreeing with you: the lack of understandable error messages
> means you have to play "guess where this EINVAL came from", which is a
> terrible user experience (should have been UX instead of UI, I guess,
> sorry about that).

Ah, ok, the term confused me :)

>
> >> > But it feels wrong to add log_level/log_size/log_buf fields to every
> >> > section of bpf_attr and require the user to specify it differently for
> >> > each command. So before we go and start adding per-command fields,
> >> > let's discuss how we can do this more generically. I wonder if we can
> >> > come up with a good way to do it in one common way and then gradually
> >> > support that way throughout all BPF commands.
> >> >
> >> > Unfortunately it's too late to just add a few common fields to
> >> > bpf_attr in front of all other per-command fields, but here's two more
> >> > ideas just to start discussion. I hope someone can come up with
> >> > something nicer.
> >> >
> >> > 1. Currently bpf syscall accepts 3 input arguments (cmd, uattr, size).
> >> > We can extend it with two more optional arguments: one for pointer to
> >> > log-defining attr (for log_buf pointer, log_size, log_level, maybe
> >> > something more later) and another for the size of that log attr.
> >> > Beyond that we'd need some way to specify that the user actually meant
> >> > to provide those 2 optional args. The most straightforward way would
> >> > be to use the highest bit of cmd argument. So instead of calling bpf()
> >> > with BPF_MAP_CREATE command, you'd call it with BPF_MAP_CREATE |
> >> > BPF_LOGGED, where BPF_LOGGED = 1<<31.
> >>
> >> Well, if only we'd had a 'flags' argument to the syscall... I don't
> >> suppose we want a bpf2()? :)
> >>
> >> I like your idea of just using the top bits of the 'cmd' field as flag
> >> bits, but in that case we should just define this explicitly, say
> >> '#define BPF_CMD_FLAGS_MASK 0xFFFF0000'?
> >
> > sure
> >
> >>
> >> And instead of adding new arguments, why not just change the meaning of
> >> the 'attr' argument? Say we define:
> >>
> >> struct bpf_extended_attr {
> >>        __u32 log_level;
> >>        __u32 log_size;
> >>        __aligned_u64 log_buf;
> >>        __u8 reserved[48];
> >>        union bpf_attr attr;
> >> };
> >
> > because this is a major PITA for libraries, plus it's not very
> > extensible, once you run out of 48 bytes? And when you don't need
> > those 48 bytes, you still need to zero them out, the kernel still
> > needs to copy them, etc. It just feels unclean to me.
> >
> > But before we argue that, is there a problem adding 2 more arguments
> > which are never used/read unless we have an extra bit set in cmd?
> > Honest question, are there any implications to user-space with such an
> > approach? Backwards-compatibility issues or anything?
>
> No idea; I don't know enough about how the lower-level details of the
> syscall interface works to tell either way. Is it even *possible* to add
> arguments like that in a backwards-compatible way?
>
> However, assuming it *is* possible, my larger point was that we
> shouldn't add just a 'logging struct', but rather a 'common options
> struct' which can be extended further as needed. And if it is *not*
> possible to add new arguments to a syscall like you're proposing, my
> suggestion above would be a different way to achieve basically the same
> (at the cost of having to specify the maximum reserved space in advance).
>

yeah-yeah, I agree, it's less a "logging attr", more of "common attr
across all commands".

> >> And then define a new flag BPF_USES_EXTENDED_ATTR which will cause the
> >> kernel to interpret the second argument of the syscall as a pointer to
> >> that struct instead of to the bpf_attr union?
> >>
> >> > 2. A more "stateful" approach, would be to have an extra BPF command
> >> > to set log buffer (and level) per thread. And if such a log is set, it
> >> > would be overwritten with content on each bpf() syscall invocation
> >> > (i.e., log position would be reset to 0 on each BPF syscall).
> >>
> >> I don't think adding something stateful is a good idea; that's bound to
> >> lead to weird issues when someone messes up the state management in
> >> userspace.
> >
> > I agree, I'd prefer a stateless approach, but wanted to lay out a
> > stateful one for completeness as well.
>
> OK, cool.
>
> >>
> >> > Of course, the existing BPF_LOAD_PROG command would keep working with
> >> > existing log, but could fall back to the "common one", if
> >> > BPF_LOAD_PROG-specific one is not set.
> >> >
> >> > It also doesn't seem to be all that critical to signal when the log
> >> > buffer is overflown. It's pretty easy to detect from user-space:
> >> > - either last byte would be non-zero, if we don't care about
> >> > guaranteeing zero-termination for truncated log;
> >> > - of second-to-last byte would be non-zero, if BPF syscall will always
> >> > zero-terminate the log.
> >>
> >> I think if we're doing this we should think about making the contents of
> >> the log machine-readable, so applications can figure out what's going on
> >> without having to parse the text strings. The simplest would be to make
> >> this new log buffer use TLV-style messaging, say we define the log
> >> buffer output to be a series of messages like these:
> >>
> >> struct bpf_log_msg {
> >>        __u16 type;
> >>        __u32 len;
> >>        __u8 contents[]; /* of size 'len' */
> >> } __attribute__((packed));
> >>
> >> To begin with we could define two types:
> >>
> >> struct bpf_log_msg_string {
> >>        __u16 type; /* BPF_LOG_MSG_TYPE_STRING */
> >>        __u32 len;
> >>        char message[];
> >> }  __attribute__((packed));
> >>
> >> struct bpf_log_msg_end {
> >>        __u16 type; /* BPF_LOG_MSG_TYPE_END */
> >>        __u32 len;
> >>        __u16 num_truncations;
> >> }  __attribute__((packed));
> >>
> >> The TYPE_STRING would just be a wrapper for the existing text-based
> >> messages, but delimited so userspace can pick them apart. And the second
> >> one would solve your 'has the log been truncated' issue above; the
> >> kernel simply always reserves eight bytes at the end of the buffer and
> >> ends with writing a TYPE_END message with the number of messages that
> >> were dropped due to lack of space. That would make it trivial for
> >> userspace to detect truncation.
> >>
> >
> > Log truncation is not an issue, we can make bpf syscall to write back
> > the actual size of emitted log (and optionally extra bool to mean
> > "truncated") into the original log_size field.
>
> So what was all that you were talking about with "check the
> second-to-last byte" in your previous email? I understood that to be
> about detecting truncation?

Yeah, ignore my initial rambling. One can do that (detecting
truncationg) without any extra "feedback" from bpf syscall, but I
think returning filled length is probably a better approach and
doesn't hamper any other aspects.

>
> >> We could then add new message types later for machine-consumption. Say,
> >> and extended error code, or offsets into the BTF information, or
> >> whatever we end up needing. But just wrapping the existing log messages
> >> in TLVs like the ones above could be fairly straight-forwardly
> >> implemented with the existing bpf_log() infrastructure in the kernel,
> >> without having to settle on which machine-readable information is useful
> >> ahead of time... And the TLV format makes it easy for userspace to just
> >> skip message types it doesn't understand.
> >>
> >> WDYT?
> >
> > I think it's taking it a bit too far and adds more API on top of
> > existing API. Now all those types become a fixed API, messages become
> > an API, etc. Just more backwards compatibility stuff we need to
> > support forever, for, what I believe, very little gain.
> >
> > In practice, using human-readable strings works just fine. If there is
> > any kind of real issue, usually it involves humans reading debug logs
> > and understanding what's going on.
>
> Let me give an example of what I want to be able to do. Just today I
> helped someone debug getting xdp-filter to run, and they were getting
> output like this:
>
>       libbpf: -- BEGIN DUMP LOG ---
>       libbpf:
>       xdpfilt_alw_all() is not a global function
>       processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0
>       peak_states 0 mark_read 0
>
>       libbpf: -- END LOG --
>
> I would like to have xdp-filter catch that, and turn it into a
> friendlier "your compiler is too old" message. Having to effectively
> grep through the free-form log output to pick out that message feels
> brittle and error prone, as opposed to just having the kernel add a
> machine-readable id ("err_func_linkage_not_global" or somesuch) and
> stick it in a machine-parsable TLV.

Yeah, I guessed as much, but I still think it's unnecessarily
restrictive (for BPF kernel developers) to adhere to such a rigorous
error format. We'll have a few hundred different failure reasons, and
with the verifier constantly evolving some of them might get
deprecated, others will be constantly added, some other might change
the meaning, etc. I don't see this realistically working long term. So
at that point, free-form text-delimited text is as good as we should
(and, realistically, would) probably go. Library, if it wants to be
extra-helpful, can still reasonably easily parse line-delimited text,
even if that's not an ideal setup (especially if error messages will
get periodically changed).

But I'd be happy to hear what others think on this matter as well, of course.

>
> > Also adopting these packet-like messages is not as straightforward
> > through BPF code, as now you can't just construct a single log line
> > with few calls to bpf_log().
>
> Why not? bpf_log() could just transparently write the four bytes of
> header (TYPE_STRING, followed by strlen(msg)) into the buffer before the
> string? And in the future, an enhanced version could take (say) an error
> ID as another parameter and transparently add that as a separate message.

I mean when you construct one error message with few printf-like
functions. We do have that in libbpf, but I haven't checked the
verifier code. Basically, assuming one bpf_log() call is a complete
"message" might not be true.

>
> >> > Of course, if user code cares about such detection of log truncation,
> >> > it will need to set last/second-to-last bytes to zero before each
> >> > syscall.
> >> >
> >> > Both approaches have their pros/cons, we can dig into those later, but
> >> > I'd like to start this discussion and see what other people think. I
> >> > also wonder if there are other syscalls with similarly advanced input
> >> > arguments (like bpf_attr) and common logging, we could learn from
> >> > those.
> >>
> >> The first one that comes to mind is netlink extacks. Of course it's not
> >> quite comparable since netlink already has message-based semantics, but
> >> it does do sorta-kinda the same thing from a user PoV. The TLV format is
> >> obviously inspired by netlink (or, well, binary networking protocols in
> >> general).
> >>
> >
> > Yeah, I'm aware of extack, but as you said, TLV is more of a netlink
> > format, extack messages themselves are just strings. But my question
> > was more of how this log could be done for complicated API calls using
> > similar extendable attrs, like perf_event_open, clone3, openat2, etc.
>
> Ah, right, no idea :)
>
> -Toke
>