Re: [PATCH bpf-next 3/3] selftests/bpf: add tests for user- and non-CO-RE BPF_CORE_READ() variants

Gilad Reti <gilad.reti@xxxxxxxxx> · Wed, 6 Jan 2021 12:10:22 +0200

On Wed, Jan 6, 2021 at 8:09 AM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Tue, Jan 05, 2021 at 01:03:47PM -0800, Andrii Nakryiko wrote:
> > > >
> > > >   - handling 32-bit vs 64-bit UAPI structs uniformly;
> > >
> > > what do you mean?
> > > 32-bit user space running on 64-bit kernel works through 'compat' syscalls.
> > > If bpf progs are accessing 64-bit uapi structs in such case they're broken
> > > and no amount of co-re can help.
> >
> > I know nothing about compat, so can't comment on that. But the way I
> > understood the situation was the BPF program compiled once (well, at
> > least from the unmodified source code), but runs on ARM32 and (on a
> > separate physical host) on ARM64. And it's task is, say, to read UAPI
> > kernel structures from syscall arguments.
>
> I'm not sure about arm, but on x86 there should be two different progs
> for this to work (if we're talking about 32-bit userspace).
> See below.
>
> > One simple example I found in UAPI definitions is struct timespec, it
> > seems it's defined with `long`:
> >
> > struct timespec {
> >         __kernel_old_time_t     tv_sec;         /* seconds */
> >         long                    tv_nsec;        /* nanoseconds */
> > }
> >
> > So if you were to trace clock_gettime(), you'd need to deal with
> > differently-sized reads of tv_nsec, depending on whether you are
> > running on the 32-bit or 64-bit host.
>
> I believe gettime is vdso, so that's not a syscall and there are no bpf hooks
> available to track that.
> For normal syscall with timespec argument like sys_recvmmsg and sys_nanosleep
> the user needs to load two programs and attach to two different points:
> fentry/__x64_sys_nanosleep and fentry/__ia32_sys_nanosleep.
> (I'm not an expert on compat and not quite sure how _time32 suffix is handled,
> so may be 4 different progs are needed).
> The point is that there are several different entry points with different args.
> ia32 user space will call into the kernel differently.
> At the end different pieces of the syscall and compat handling will call
> into hrtimer_nanosleep with ktime.
> So if one bpf prog needs to work for 32 and 64 bit user space it should be
> attached to common piece of code like hrtimer_nanosleep().
> If it's attached to syscall entry it needs to attach to at least 2 different
> entry points with 2 different progs.

I think that while it is true that syscall handlers are different for
32 and 64 bit,
it may happen that some syscall handlers will pass user-space pointers down to
shared common functions, which may be hooked by ebpf probes. I am no expert, but
I think that the cp_new_stat function is such an example; it gets a
user struct stat* pointer,
whose definition varies across different architectures (32/64-bit,
different processor archs etc.)

> I guess it's possible to load single prog and kprobe-attach it to
> two different kernel functions, but code is kinda different.
> For the simplest things like sys_nanosleep it might work and timespec
> is simple enough structure to handle with sizeof(long) co-re tricks,
> but it's not a generally applicable approach.
> So for a tool to track 32-bit and 64-bit user space is quite tricky.
>
> If bpf prog doesn't care about user space and only needs to deal
> with 64-bit kernel + 64-bit user space and 32-bit kernel + 32-bit user space
> then it's a bit simpler, but probably still requires attaching
> to two different points. On 32-bit x86 kernel there will be no
> "fentry/__x64_sys_nanosleep" function.
>
> > There are probably other examples where UAPI structs use long instead
> > of __u32 or __u64, but I didn't dig too deep.
>
> There is also crazy difference between compact_ioctl vs ioctl.
>
> The point I'm trying to get across is that the clean 32-bit processing is
> a lot more involved than just CORE_READ_USER macro and I want to make sure that
> the expections are set correctly.
> BPF CO-RE prog may handle 32-bit and 64-bit at the same time, but
> it's probably exception than the rule.
>
> > Let's see if Gilad can provide his perspective. I have no strong
> > feelings about this and can send a patch removing CORE_READ_USER
> > variants (they didn't make it into libbpf v0.3, so no harm or API
> > stability concerns). BPF_PROBE_READ() and BPF_PROBE_READ_USER() are
> > still useful for reading non-relocatable, but nested data structures,
> > so I'd prefer to keep those.
>
> Let's hear from Gilad.
> I'm not against BPF_CORE_READ_USER macro as-is. I was objecting
> to the reasons of implementing it.

If I am not mistaken (which is completely possible), I think that
providing such a macro will
not cause any more confusion than the bpf_probe_read_{,user}
distinction already does,
since BPF_CORE_READ_USER to BPF_CORE_READ is the same as bpf_probe_read_user
to bpf_probe_read.