Re: Draft 3 of bpf(2) man page for review

Daniel Borkmann <daniel@xxxxxxxxxxxxx> · Thu, 23 Jul 2015 14:47:14 +0200

On 07/23/2015 01:23 PM, Michael Kerrisk (man-pages) wrote:
...
Btw, a user obviously can close() the map fds if he
wants to, but ultimatively they're freed when the program unloads.

Okay. (Not sure if you meant that something should be added to the page.)

I think not necessary.

[...]
                The attributes key_size and value_size will be used by the

attribute's?

Nope. But I changed this to "The key_size and value_size attributes will be",
which may read clearer.

Sorry, true, I was a bit confused. :)

[...]
The type __u64 is kernel internal, so if there's no strict reason to use it,
we should just use what's provided by stdint.h.

Agreed. Done. (By the way, what about all the __u32 and __u64 elements in the
bpf_attr union?)

I wouldn't change the bpf_attr from the uapi.

Just the provided example code here, I presume people might copy from here when
they build their own library and in userspace uint64_t seems to be more natural.

[...]
                *  map_update_elem()  replaces  elements  in an non-atomic
                   fashion; for atomic updates, a hash-table map should be
                   used instead.

This point here is most important, i.e. to not have false user expecations.
Maybe it's also worth mentioning that when you have a value_size of sizeof(long),
you can however use __sync_fetch_and_add() atomic builtin from the LLVM backend.

I think I'll leave out that detail for the moment.

Ok, I guess we could revisit/clarify that at a later point in time. I'd add
a TODO comment to the source or the like, as this also is related to the 2nd
below use case (aggregation/accounting), where an array is typically used.

                Among the uses for array maps are the following:

                *  As "global" eBPF variables: an array of 1 element whose
                   key is (index) 0 and where the value is a collection of
                   'global'  variables which eBPF programs can use to keep
                   state between events.

                *  Aggregation of tracing events into a fixed set of buck‐
                   ets.

[...]
         *  license is a license string, which must be GPL  compatible  to
            call helper functions marked gpl_only.

Not strictly. So here, the same rules apply as with kernel modules. I.e. what
the kernel checks for are the following license strings:

static inline int license_is_gpl_compatible(const char *license)
{
	return (strcmp(license, "GPL") == 0
		|| strcmp(license, "GPL v2") == 0
		|| strcmp(license, "GPL and additional rights") == 0
		|| strcmp(license, "Dual BSD/GPL") == 0
		|| strcmp(license, "Dual MIT/GPL") == 0
		|| strcmp(license, "Dual MPL/GPL") == 0);
}

With any of them, the eBPF program is declared GPL compatible. Maybe of interest
for those that want to use dual licensing of some sort.

So, I'm a little unclear here. What text do you suggest for the page?

Maybe we should mention in addition that the same licensing rules apply as
in case with kernel modules, so also dual licenses could be used.

         *  log_buf is a pointer to a caller-allocated buffer in which the
            in-kernel verifier can store the verification log.   This  log
            is  a  multi-line  string  that  can be checked by the program
            author in order to understand how the  verifier  came  to  the
            conclusion  that the BPF program is unsafe.  The format of the
            output can change at any time as the verifier evolves.

         *  log_size size of the buffer pointed to  by  log_bug.   If  the
            size  of  the buffer is not large enough to store all verifier
            messages, -1 is returned and errno is set to ENOSPC.

         *  log_level verbosity level of the verifier.  A  value  of  zero
            means that the verifier will not provide a log.

Note that the log buffer is optional as mentioned here log_level = 0. The
above example code of bpf_prog_load() suggests that it always needs to be
provided.

I once ran indeed into an issue where the program itself was correct, but
it got rejected by the kernel, because my log buffer size was too small, so
in tc, we now have it larger as bpf_log_buf[65536] ...

So, I'm not clear. Do you mean that some piece of text here in the page
should be changed? If so, could elaborate?

I'd maybe only mention in addition that in log_level=0 case, we also must not
provide a log_buf and log_size, otherwise we get EINVAL.

[...]
I had to read this twice. ;) Maybe this needs to be reworded slightly.

It just means that depending on the program type that the author selects,
you might end up with a different subset of helper functions, and a
different program input/context. For example tracing does not have the
exact same helpers as socket filters (it might have some that can be used
by both). Also, the eBPF program input (context) for socket filters is a
network packet, wheras for tracing you operate on a set of registers.

Changed. Now we have:

    eBPF program types
        The eBPF program type (prog_type) determines the subset of a ker‐
        nel helper functions that the program may call.  The program type

s/a//

        also determines dthe program input (context)—the format of struct

s/dthe/the/

        bpf_context (which is the data blob passed into the eBPF  program
        as the first argument).

        For  example, a tracing program does not have the exact same sub‐
        set of helper functions as a socket filter program  (though  they
        may have some helpers in common).  Similarly, the input (context)
        for a tracing program is a set of register values,  while  for  a
        socket filter it is a network packet.

        The  set  of functions available to eBPF programs of a given type
        may increase in the future.

That's fine with me.

[...]
I would also make a note about the JIT compiler here, i.e. that it's disabled
by default, and can be enabled via:

* Normal mode: echo 1 > /proc/sys/net/core/bpf_jit_enable

* Debugging mode: echo 2 > /proc/sys/net/core/bpf_jit_enable
    [opcodes dumped in hex into the kernel log, which can then be disassembled

Here, I assume you mean thet the generated (native) opcodes are dumpeed, right?

Yes.

     with tools/net/bpf_jit_disasm.c from the kernel tree]

When enabled, after a eBPF program gets loaded, it's transparently compiled /
translated inside the kernel into machine opcodes for better performance,
currently on x86_64, arm64 and s390.

According to Documentation/networking/filter.txt the JIT compiler supports
many more architectures:

     The Linux kernel has a built-in BPF JIT compiler for x86_64,
     SPARC, PowerPC, ARM, ARM64, MIPS and s390 and can be enabled
     through CONFIG_BPF_JIT.

Or am I misunderstanding something?

The others only work for cBPF and have not (yet) be converted over to eBPF.

For the three mentioned above, the kernel internally migrates cBPF into eBPF
instructions and then JITs the eBPF result eventually.

I added the following:

        The kernel contains a just-in-time (JIT) compiler that translates
        eBPF  bytecode  into  native machine code for better performance.
        The JIT compiler is disabled by default, but its operation can be
        controlled   by   writing   one   of   the  following  values  to
        /proc/sys/net/core/bpf_jit_enable:

        0  Disable JIT compilation (default).

        1  Normal compilation.

        2  Debugging mode.  The generated opcodes are dumped in hexadeci‐
           mal  into the kernel log.  These opcodes can then be disassem‐
           bled using the program tools/net/bpf_jit_disasm.c provided  in
           the kernel source tree.

SEE ALSO
         seccomp(2), socket(7), tc(8), tc-bpf(8)

         Both classic and extended BPF are explained in the kernel  source
         file Documentation/networking/filter.txt.

Rest looks good for an initial version!

Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html