Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes: > On Tue, Jul 14, 2020 at 5:12 AM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: >> >> Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes: >> >> > On Mon, Jul 13, 2020 at 1:13 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote: >> >> >> >> From: Toke Høiland-Jørgensen <toke@xxxxxxxxxx> >> >> >> >> Sync addition of new members from main kernel tree. >> >> >> >> Signed-off-by: Toke Høiland-Jørgensen <toke@xxxxxxxxxx> >> >> --- >> >> tools/include/uapi/linux/bpf.h | 9 +++++++-- >> >> 1 file changed, 7 insertions(+), 2 deletions(-) >> >> >> >> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h >> >> index da9bf35a26f8..662a15e4a1a1 100644 >> >> --- a/tools/include/uapi/linux/bpf.h >> >> +++ b/tools/include/uapi/linux/bpf.h >> >> @@ -573,8 +573,13 @@ union bpf_attr { >> >> } query; >> >> >> >> struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */ >> >> - __u64 name; >> >> - __u32 prog_fd; >> >> + __u64 name; >> >> + __u32 prog_fd; >> >> + __u32 log_level; /* verbosity level of log */ >> >> + __u32 log_size; /* size of user buffer */ >> >> + __aligned_u64 log_buf; /* user supplied buffer */ >> >> + __u32 tgt_prog_fd; >> >> + __u32 tgt_btf_id; >> >> } raw_tracepoint; >> >> >> >> struct { /* anonymous struct for BPF_BTF_LOAD */ >> >> >> > >> > I think BPF syscall would benefit from common/generalized log support >> > across all commands, given how powerful/complex it already is. >> > Sometimes it's literally impossible to understand why one gets -EINVAL >> > without adding printk()s in the kernel. >> >> Yes, I agree! This is horrible UI! > > UI?.. It's a perfectly fine and extensible API for all functionality > it provides, it just needs a better human-readable feedback mechanism, > which is what I'm proposing. Error codes are not working when you have > so many different situations that can result in error. Yes. I was agreeing with you: the lack of understandable error messages means you have to play "guess where this EINVAL came from", which is a terrible user experience (should have been UX instead of UI, I guess, sorry about that). >> > But it feels wrong to add log_level/log_size/log_buf fields to every >> > section of bpf_attr and require the user to specify it differently for >> > each command. So before we go and start adding per-command fields, >> > let's discuss how we can do this more generically. I wonder if we can >> > come up with a good way to do it in one common way and then gradually >> > support that way throughout all BPF commands. >> > >> > Unfortunately it's too late to just add a few common fields to >> > bpf_attr in front of all other per-command fields, but here's two more >> > ideas just to start discussion. I hope someone can come up with >> > something nicer. >> > >> > 1. Currently bpf syscall accepts 3 input arguments (cmd, uattr, size). >> > We can extend it with two more optional arguments: one for pointer to >> > log-defining attr (for log_buf pointer, log_size, log_level, maybe >> > something more later) and another for the size of that log attr. >> > Beyond that we'd need some way to specify that the user actually meant >> > to provide those 2 optional args. The most straightforward way would >> > be to use the highest bit of cmd argument. So instead of calling bpf() >> > with BPF_MAP_CREATE command, you'd call it with BPF_MAP_CREATE | >> > BPF_LOGGED, where BPF_LOGGED = 1<<31. >> >> Well, if only we'd had a 'flags' argument to the syscall... I don't >> suppose we want a bpf2()? :) >> >> I like your idea of just using the top bits of the 'cmd' field as flag >> bits, but in that case we should just define this explicitly, say >> '#define BPF_CMD_FLAGS_MASK 0xFFFF0000'? > > sure > >> >> And instead of adding new arguments, why not just change the meaning of >> the 'attr' argument? Say we define: >> >> struct bpf_extended_attr { >> __u32 log_level; >> __u32 log_size; >> __aligned_u64 log_buf; >> __u8 reserved[48]; >> union bpf_attr attr; >> }; > > because this is a major PITA for libraries, plus it's not very > extensible, once you run out of 48 bytes? And when you don't need > those 48 bytes, you still need to zero them out, the kernel still > needs to copy them, etc. It just feels unclean to me. > > But before we argue that, is there a problem adding 2 more arguments > which are never used/read unless we have an extra bit set in cmd? > Honest question, are there any implications to user-space with such an > approach? Backwards-compatibility issues or anything? No idea; I don't know enough about how the lower-level details of the syscall interface works to tell either way. Is it even *possible* to add arguments like that in a backwards-compatible way? However, assuming it *is* possible, my larger point was that we shouldn't add just a 'logging struct', but rather a 'common options struct' which can be extended further as needed. And if it is *not* possible to add new arguments to a syscall like you're proposing, my suggestion above would be a different way to achieve basically the same (at the cost of having to specify the maximum reserved space in advance). >> And then define a new flag BPF_USES_EXTENDED_ATTR which will cause the >> kernel to interpret the second argument of the syscall as a pointer to >> that struct instead of to the bpf_attr union? >> >> > 2. A more "stateful" approach, would be to have an extra BPF command >> > to set log buffer (and level) per thread. And if such a log is set, it >> > would be overwritten with content on each bpf() syscall invocation >> > (i.e., log position would be reset to 0 on each BPF syscall). >> >> I don't think adding something stateful is a good idea; that's bound to >> lead to weird issues when someone messes up the state management in >> userspace. > > I agree, I'd prefer a stateless approach, but wanted to lay out a > stateful one for completeness as well. OK, cool. >> >> > Of course, the existing BPF_LOAD_PROG command would keep working with >> > existing log, but could fall back to the "common one", if >> > BPF_LOAD_PROG-specific one is not set. >> > >> > It also doesn't seem to be all that critical to signal when the log >> > buffer is overflown. It's pretty easy to detect from user-space: >> > - either last byte would be non-zero, if we don't care about >> > guaranteeing zero-termination for truncated log; >> > - of second-to-last byte would be non-zero, if BPF syscall will always >> > zero-terminate the log. >> >> I think if we're doing this we should think about making the contents of >> the log machine-readable, so applications can figure out what's going on >> without having to parse the text strings. The simplest would be to make >> this new log buffer use TLV-style messaging, say we define the log >> buffer output to be a series of messages like these: >> >> struct bpf_log_msg { >> __u16 type; >> __u32 len; >> __u8 contents[]; /* of size 'len' */ >> } __attribute__((packed)); >> >> To begin with we could define two types: >> >> struct bpf_log_msg_string { >> __u16 type; /* BPF_LOG_MSG_TYPE_STRING */ >> __u32 len; >> char message[]; >> } __attribute__((packed)); >> >> struct bpf_log_msg_end { >> __u16 type; /* BPF_LOG_MSG_TYPE_END */ >> __u32 len; >> __u16 num_truncations; >> } __attribute__((packed)); >> >> The TYPE_STRING would just be a wrapper for the existing text-based >> messages, but delimited so userspace can pick them apart. And the second >> one would solve your 'has the log been truncated' issue above; the >> kernel simply always reserves eight bytes at the end of the buffer and >> ends with writing a TYPE_END message with the number of messages that >> were dropped due to lack of space. That would make it trivial for >> userspace to detect truncation. >> > > Log truncation is not an issue, we can make bpf syscall to write back > the actual size of emitted log (and optionally extra bool to mean > "truncated") into the original log_size field. So what was all that you were talking about with "check the second-to-last byte" in your previous email? I understood that to be about detecting truncation? >> We could then add new message types later for machine-consumption. Say, >> and extended error code, or offsets into the BTF information, or >> whatever we end up needing. But just wrapping the existing log messages >> in TLVs like the ones above could be fairly straight-forwardly >> implemented with the existing bpf_log() infrastructure in the kernel, >> without having to settle on which machine-readable information is useful >> ahead of time... And the TLV format makes it easy for userspace to just >> skip message types it doesn't understand. >> >> WDYT? > > I think it's taking it a bit too far and adds more API on top of > existing API. Now all those types become a fixed API, messages become > an API, etc. Just more backwards compatibility stuff we need to > support forever, for, what I believe, very little gain. > > In practice, using human-readable strings works just fine. If there is > any kind of real issue, usually it involves humans reading debug logs > and understanding what's going on. Let me give an example of what I want to be able to do. Just today I helped someone debug getting xdp-filter to run, and they were getting output like this: libbpf: -- BEGIN DUMP LOG --- libbpf: xdpfilt_alw_all() is not a global function processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 libbpf: -- END LOG -- I would like to have xdp-filter catch that, and turn it into a friendlier "your compiler is too old" message. Having to effectively grep through the free-form log output to pick out that message feels brittle and error prone, as opposed to just having the kernel add a machine-readable id ("err_func_linkage_not_global" or somesuch) and stick it in a machine-parsable TLV. > Also adopting these packet-like messages is not as straightforward > through BPF code, as now you can't just construct a single log line > with few calls to bpf_log(). Why not? bpf_log() could just transparently write the four bytes of header (TYPE_STRING, followed by strlen(msg)) into the buffer before the string? And in the future, an enhanced version could take (say) an error ID as another parameter and transparently add that as a separate message. >> > Of course, if user code cares about such detection of log truncation, >> > it will need to set last/second-to-last bytes to zero before each >> > syscall. >> > >> > Both approaches have their pros/cons, we can dig into those later, but >> > I'd like to start this discussion and see what other people think. I >> > also wonder if there are other syscalls with similarly advanced input >> > arguments (like bpf_attr) and common logging, we could learn from >> > those. >> >> The first one that comes to mind is netlink extacks. Of course it's not >> quite comparable since netlink already has message-based semantics, but >> it does do sorta-kinda the same thing from a user PoV. The TLV format is >> obviously inspired by netlink (or, well, binary networking protocols in >> general). >> > > Yeah, I'm aware of extack, but as you said, TLV is more of a netlink > format, extack messages themselves are just strings. But my question > was more of how this log could be done for complicated API calls using > similar extendable attrs, like perf_event_open, clone3, openat2, etc. Ah, right, no idea :) -Toke