Re: [PATCH bpf-next 3/3] selftests/bpf: add tests for user- and non-CO-RE BPF_CORE_READ() variants

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Tue, 5 Jan 2021 22:09:20 -0800

On Tue, Jan 05, 2021 at 01:03:47PM -0800, Andrii Nakryiko wrote:
> > >
> > >   - handling 32-bit vs 64-bit UAPI structs uniformly;
> >
> > what do you mean?
> > 32-bit user space running on 64-bit kernel works through 'compat' syscalls.
> > If bpf progs are accessing 64-bit uapi structs in such case they're broken
> > and no amount of co-re can help.
> 
> I know nothing about compat, so can't comment on that. But the way I
> understood the situation was the BPF program compiled once (well, at
> least from the unmodified source code), but runs on ARM32 and (on a
> separate physical host) on ARM64. And it's task is, say, to read UAPI
> kernel structures from syscall arguments.

I'm not sure about arm, but on x86 there should be two different progs
for this to work (if we're talking about 32-bit userspace).
See below.

> One simple example I found in UAPI definitions is struct timespec, it
> seems it's defined with `long`:
> 
> struct timespec {
>         __kernel_old_time_t     tv_sec;         /* seconds */
>         long                    tv_nsec;        /* nanoseconds */
> }
> 
> So if you were to trace clock_gettime(), you'd need to deal with
> differently-sized reads of tv_nsec, depending on whether you are
> running on the 32-bit or 64-bit host.

I believe gettime is vdso, so that's not a syscall and there are no bpf hooks
available to track that.
For normal syscall with timespec argument like sys_recvmmsg and sys_nanosleep
the user needs to load two programs and attach to two different points:
fentry/__x64_sys_nanosleep and fentry/__ia32_sys_nanosleep.
(I'm not an expert on compat and not quite sure how _time32 suffix is handled,
so may be 4 different progs are needed).
The point is that there are several different entry points with different args.
ia32 user space will call into the kernel differently.
At the end different pieces of the syscall and compat handling will call
into hrtimer_nanosleep with ktime.
So if one bpf prog needs to work for 32 and 64 bit user space it should be
attached to common piece of code like hrtimer_nanosleep().
If it's attached to syscall entry it needs to attach to at least 2 different
entry points with 2 different progs.
I guess it's possible to load single prog and kprobe-attach it to
two different kernel functions, but code is kinda different.
For the simplest things like sys_nanosleep it might work and timespec
is simple enough structure to handle with sizeof(long) co-re tricks,
but it's not a generally applicable approach.
So for a tool to track 32-bit and 64-bit user space is quite tricky.

If bpf prog doesn't care about user space and only needs to deal
with 64-bit kernel + 64-bit user space and 32-bit kernel + 32-bit user space
then it's a bit simpler, but probably still requires attaching
to two different points. On 32-bit x86 kernel there will be no
"fentry/__x64_sys_nanosleep" function.

> There are probably other examples where UAPI structs use long instead
> of __u32 or __u64, but I didn't dig too deep.

There is also crazy difference between compact_ioctl vs ioctl.

The point I'm trying to get across is that the clean 32-bit processing is
a lot more involved than just CORE_READ_USER macro and I want to make sure that
the expections are set correctly.
BPF CO-RE prog may handle 32-bit and 64-bit at the same time, but
it's probably exception than the rule.

> Let's see if Gilad can provide his perspective. I have no strong
> feelings about this and can send a patch removing CORE_READ_USER
> variants (they didn't make it into libbpf v0.3, so no harm or API
> stability concerns). BPF_PROBE_READ() and BPF_PROBE_READ_USER() are
> still useful for reading non-relocatable, but nested data structures,
> so I'd prefer to keep those.

Let's hear from Gilad.
I'm not against BPF_CORE_READ_USER macro as-is. I was objecting
to the reasons of implementing it.