Re: [PATCH RFC bpf-next 2/3] libbpf: add ksyscall/kretsyscall sections support for syscall kprobes

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Thu, 7 Jul 2022 13:50:36 -0700

On Thu, Jul 7, 2022 at 12:36 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Thu, Jul 7, 2022 at 12:10 PM Andrii Nakryiko
> <andrii.nakryiko@xxxxxxxxx> wrote:
> >
> > On Thu, Jul 7, 2022 at 10:23 AM Alexei Starovoitov
> > <alexei.starovoitov@xxxxxxxxx> wrote:
> > >
> > > On Wed, Jul 6, 2022 at 5:41 PM Andrii Nakryiko <andrii@xxxxxxxxxx> wrote:
> > > >
> > > > Add SEC("ksyscall")/SEC("ksyscall/<syscall_name>") and corresponding
> > > > kretsyscall variants (for return kprobes) to allow users to kprobe
> > > > syscall functions in kernel. These special sections allow to ignore
> > > > complexities and differences between kernel versions and host
> > > > architectures when it comes to syscall wrapper and corresponding
> > > > __<arch>_sys_<syscall> vs __se_sys_<syscall> differences, depending on
> > > > CONFIG_ARCH_HAS_SYSCALL_WRAPPER.
> > > >
> > > > Combined with the use of BPF_KSYSCALL() macro, this allows to just
> > > > specify intended syscall name and expected input arguments and leave
> > > > dealing with all the variations to libbpf.
> > > >
> > > > In addition to SEC("ksyscall+") and SEC("kretsyscall+") add
> > > > bpf_program__attach_ksyscall() API which allows to specify syscall name
> > > > at runtime and provide associated BPF cookie value.
> > > >
> > > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx>
> > > > ---
> > > >  tools/lib/bpf/libbpf.c          | 109 ++++++++++++++++++++++++++++++++
> > > >  tools/lib/bpf/libbpf.h          |  16 +++++
> > > >  tools/lib/bpf/libbpf.map        |   1 +
> > > >  tools/lib/bpf/libbpf_internal.h |   2 +
> > > >  4 files changed, 128 insertions(+)
> > > >
> > > > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > > > index cb49408eb298..4749fb84e33d 100644
> > > > --- a/tools/lib/bpf/libbpf.c
> > > > +++ b/tools/lib/bpf/libbpf.c
> > > > @@ -4654,6 +4654,65 @@ static int probe_kern_btf_enum64(void)
> > > >                                              strs, sizeof(strs)));
> > > >  }
> > > >
> > > > +static const char *arch_specific_syscall_pfx(void)
> > > > +{
> > > > +#if defined(__x86_64__)
> > > > +       return "x64";
> > > > +#elif defined(__i386__)
> > > > +       return "ia32";
> > > > +#elif defined(__s390x__)
> > > > +       return "s390x";
> > > > +#elif defined(__s390__)
> > > > +       return "s390";
> > > > +#elif defined(__arm__)
> > > > +       return "arm";
> > > > +#elif defined(__aarch64__)
> > > > +       return "arm64";
> > > > +#elif defined(__mips__)
> > > > +       return "mips";
> > > > +#elif defined(__riscv)
> > > > +       return "riscv";
> > > > +#else
> > > > +       return NULL;
> > > > +#endif
> > > > +}
> > > > +
> > > > +static int probe_kern_syscall_wrapper(void)
> > > > +{
> > > > +       /* available_filter_functions is a few times smaller than
> > > > +        * /proc/kallsyms and has simpler format, so we use it as a faster way
> > > > +        * to check that __<arch>_sys_bpf symbol exists, which is a sign that
> > > > +        * kernel was built with CONFIG_ARCH_HAS_SYSCALL_WRAPPER and uses
> > > > +        * syscall wrappers
> > > > +        */
> > > > +       static const char *kprobes_file = "/sys/kernel/tracing/available_filter_functions";
> > > > +       char func_name[128], syscall_name[128];
> > > > +       const char *ksys_pfx;
> > > > +       FILE *f;
> > > > +       int cnt;
> > > > +
> > > > +       ksys_pfx = arch_specific_syscall_pfx();
> > > > +       if (!ksys_pfx)
> > > > +               return 0;
> > > > +
> > > > +       f = fopen(kprobes_file, "r");
> > > > +       if (!f)
> > > > +               return 0;
> > > > +
> > > > +       snprintf(syscall_name, sizeof(syscall_name), "__%s_sys_bpf", ksys_pfx);
> > > > +
> > > > +       /* check if bpf() syscall wrapper is listed as possible kprobe */
> > > > +       while ((cnt = fscanf(f, "%127s%*[^\n]\n", func_name)) == 1) {
> > > > +               if (strcmp(func_name, syscall_name) == 0) {
> > > > +                       fclose(f);
> > > > +                       return 1;
> > > > +               }
> > > > +       }
> > >
> > > Maybe we should do the other way around ?
> > > cat /proc/kallsyms |grep sys_bpf
> > >
> > > and figure out the prefix from there?
> > > Then we won't need to do giant
> > > #if defined(__x86_64__)
> > > ...
> > >
> >
> > Unfortunately this won't work well due to compat and 32-bit APIs (and
> > bpf() syscall is particularly bad with also bpf_sys_bpf):
> >
> > $ sudo cat /proc/kallsyms| rg '_sys_bpf$'
> > ffffffff811cb100 t __sys_bpf
> > ffffffff811cd380 T bpf_sys_bpf
> > ffffffff811cd520 T __x64_sys_bpf
> > ffffffff811cd540 T __ia32_sys_bpf
> > ffffffff8256fce0 r __ksymtab_bpf_sys_bpf
> > ffffffff8259b5a2 r __kstrtabns_bpf_sys_bpf
> > ffffffff8259bab9 r __kstrtab_bpf_sys_bpf
> > ffffffff83abc400 t _eil_addr___ia32_sys_bpf
> > ffffffff83abc410 t _eil_addr___x64_sys_bpf
>
> That actually means that the current and proposed approaches
> are both somewhat wrong, since they don't attach to both.
> Meaning all syscalls done by 32-bit userspace will not be seen
> by bpf prog.
>
> Probably libbpf should attach to both:
> __x64_sys_bpf and __ia32_sys_bpf.
>

I actually was intending to attach to "native" syscall, I haven't seen
any application/use case attaching to ia32 variants on 64-bit
architectures. Same for compat, if the kernel has it.

My thinking was that compat and 32-bit syscalls are not a typical use
case, and if user cares about them, they should be capable of figuring
out __x64_sys_xxx and __ia32_sys_xxx and __compact_whatever the name
as well.

But that's why I sent it as an RFC, to discuss stuff like this.

> __ksym and __kstr are easy to filter out, since they are
> standard prefixes.
> No idea what eil_addr is.
>
> > $ sudo cat /proc/kallsyms| rg '_sys_mmap$'
> > ffffffff81024480 T __x64_sys_mmap
> > ffffffff810244c0 T __ia32_sys_mmap
> > ffffffff83abae30 t _eil_addr___ia32_sys_mmap
> > ffffffff83abae40 t _eil_addr___x64_sys_mmap
> >
> > We have similar arch-specific switches in few other places (USDT and
> > lib path detection, for example), so it's not a new precedent (for
> > better or worse).
> >
> >
> > > /proc/kallsyms has world read permissions:
> > > proc_create("kallsyms", 0444, NULL, &kallsyms_proc_ops);
> > > unlike available_filter_functions.
> > >
> > > Also tracefs might be mounted in a different dir than
> > > /sys/kernel/tracing/
> > > like
> > > /sys/kernel/debug/tracing/
> >
> > Yeah, good point, was trying to avoid parsing more expensive kallsyms,
> > but given it's done once, it might not be a big deal.
>
> Soon we'll have an iterator for them so doing in-kernel
> search of sys_bpf would be fast.