Re: BPF logging infrastructure. Was: [PATCH bpf-next 4/6] tools: add new members to bpf_attr.raw_tracepoint in bpf.h

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Mon, 20 Jul 2020 15:30:55 -0700

On Fri, Jul 17, 2020 at 08:54:45PM -0700, Andrii Nakryiko wrote:
> 
> > Only libbpf can do it. Kernel is helpless here.
> > Say we change the kernel errno for all unsuported prog types and maps
> > it would return ENOTSUPP or something.
> > Would it really help the situation?
> 
> IMO, if the kernel just prints out "Unknown BPF command 123" or
> "Unknown map type 345" that would be already a nice improvement.
...
> log_buf can't help existing kernels. Period. No one is arguing or
> expecting that. But moving forward, just having that "unknown command
> 123" would be great.
> 
> But yeah, of course libbpf can create a probing map and try to do
> BATCH_LOOKUP, to detect BATCH_LOOKUP support.

Also with BTF the kernel is self documented.
The following will print all commands that kernel supports:
bpftool btf dump file ./bld_x64/vmlinux |grep -A40 bpf_cmd
and with 'grep BPF_MAP_TYPE_' all supported maps.
For older kernels there is 'bpftool feature probe'.
Since libbpf reads vmlinux BTF anyway it could have got a knowledge
of all the features it supports based on BTF.

But let's continue this thought experiment for augmenting error
reporting with a string.
For 'Unknown BPF command 123' to work log_buf needs to passed
outside of 'union bpf_attr'. Probably as 4th argument to sys_bpf ?
and then a bit in 'int cmd' would need to be burned to indicate
that 4th arg is there. Probably size of the arg needs to be passed
either as 5th arg or as part of the 'struct bpf_log_buf' so it can
be extensible.
Using that log_buf directly in
SYSCALL_DEFINE[45](bpf, int cmd, ... struct bpf_log_buf *log_buf)
will be trivial.
But to pass it into any of first level functions (bpf_iter_create,
bpf_prog_attach, etc) they would need to gain an extra argument.
To pass it all the way into hierarchy_allows_attach() it needs to be added to:
__cgroup_bpf_attach 
cgroup_bpf_attach
cgroup_bpf_link_attach
link_create

Another case of ambiguous 'return -EINVAL' would cause another change
to a bunch of function prototypes.
So it's better to integrate it into current task_struct to avoid huge code churn.

But if we do so what stops us from generalizing log_buf reporting to
other syscalls? perf_event_open is in the same category.

> This one is for perf subsystem, actually, it's its
> PERF_EVENT_IOC_SET_BPF ioctl (until we add bpf_link for perf_event
> attachment).

clearly not only sys_bpf and sys_perf_event_open, but sys_ioctl would need
string support to be a full answer to ambiguous einval-s.

> My proposal was about adding the ability to emit something to log_buf
> from any of the BPF commands, if that BPF command chooses to provide
> extra error information. The whole point of this was to avoid adding
> log_buf in command-specific ways (as Toke was doing in the patch that
> I used to initiate the discussion) and do it once for entire syscall,
> so that we can gradually utilize it where it makes most sense.

I don't think that works due to code churn. Whether we pay that price
once or 'gradually' it doesn't make it any better.
When log_buf is added to an existing command the 'union bpf_attr' is there
in the function proto and nothing new needs to passed to a lot of functions.
So I certainly prefer Toke's approach of adding log_buf to one specific
command if it's really needed there.

The alternative is to solve it for all syscalls.

> I agree that if such diagnostics are reliable and the situation itself
> is common and experienced by multiple users, then it might make sense
> to add such checks to libbpf. 

'experienced by multiple users' is going to be a judgement call
either for libbpf or for kernel.
I'm saying let's improve libbpf user-friendlyness everywhere we can.
We can always drop these hints later. Unlike kernel messages that
might become stable api.
One thing is log_buf that the verifier is dumping. It's huge and not parsable.
Whereas a string next to return EINVAL may become an uapi.
I wouldn't be surprised if some of netlink extack strings got to this
level of stability and changing the string may cause somebody to notice
and the commit would get reverted.

> But I also don't think it's always
> possible to diagnose something automatically with 100% confidence. We
> can give hints, but misdiagnosing the problem can just further confuse
> things instead.

I think the possible confusion is fine.
The libbpf has an opportunity to try them and remove if things don't work out.
The kernel strings would be much more scrutinized and harder to change.

> Also quite often such problems are one-offs, which
> doesn't make them any less confusing and frustrating, but custom
> diagnostics for every such case has a potential of bloating libbpf
> beyond recognition

bloating libbpf with strings is imo better way to figure out how to
improve user experience than bloating kernel.

> of them. That's what I meant that this is not a scalable approach to
> just say "fix libbpf to be more user friendly, kernel does its best
> already".

Ok. I will take it back. The kernel can be improved with extra strings
here and there, but only if it's done without huge code churn.
If current task_struct approach can work and the changes would be
limited to something like:

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index ac53102e244a..244df18728c2 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -196,8 +196,12 @@ static bool hierarchy_allows_attach(struct cgroup *cgrp,
                        return true;
                cnt = prog_list_length(&p->bpf.progs[type]);
                WARN_ON_ONCE(cnt > 1);
-               if (cnt == 1)
-                       return !!(flags & BPF_F_ALLOW_OVERRIDE);
+               if (cnt == 1) {
+                       ret = !!(flags & BPF_F_ALLOW_OVERRIDE);
+                       if (!ret)
+                               syscall_string("cgroup has non-overridable program in the parent");
+                       return ret;
+               }

Then such syscall_string reporting mechanism would be solid addition to the
kernel. Otherwise the cost of passing explicit log_buf everywhere is not worth
it.

I think such syscall_string can probably piggy back on "socket local storage
into generic local storage" work. Sooner or later we will have
per task_struct storage. If that's a single pointer in task_struct that will
be used by both task local storage from inside tracing bpf program and
by this syscall_string() reporting mechanism then we can land it.
May be all syscalls will become stateful. sys_bpf, sys_perf_event_open, sys_ioctl
followed by new syscall "give me back the error string".
Or may be we can do some asm magic and pass both errno and a string from
a syscall at the same time.