On Wed, Jan 6, 2021 at 8:09 AM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Tue, Jan 05, 2021 at 01:03:47PM -0800, Andrii Nakryiko wrote: > > > > > > > > - handling 32-bit vs 64-bit UAPI structs uniformly; > > > > > > what do you mean? > > > 32-bit user space running on 64-bit kernel works through 'compat' syscalls. > > > If bpf progs are accessing 64-bit uapi structs in such case they're broken > > > and no amount of co-re can help. > > > > I know nothing about compat, so can't comment on that. But the way I > > understood the situation was the BPF program compiled once (well, at > > least from the unmodified source code), but runs on ARM32 and (on a > > separate physical host) on ARM64. And it's task is, say, to read UAPI > > kernel structures from syscall arguments. > > I'm not sure about arm, but on x86 there should be two different progs > for this to work (if we're talking about 32-bit userspace). > See below. > > > One simple example I found in UAPI definitions is struct timespec, it > > seems it's defined with `long`: > > > > struct timespec { > > __kernel_old_time_t tv_sec; /* seconds */ > > long tv_nsec; /* nanoseconds */ > > } > > > > So if you were to trace clock_gettime(), you'd need to deal with > > differently-sized reads of tv_nsec, depending on whether you are > > running on the 32-bit or 64-bit host. > > I believe gettime is vdso, so that's not a syscall and there are no bpf hooks > available to track that. > For normal syscall with timespec argument like sys_recvmmsg and sys_nanosleep > the user needs to load two programs and attach to two different points: > fentry/__x64_sys_nanosleep and fentry/__ia32_sys_nanosleep. > (I'm not an expert on compat and not quite sure how _time32 suffix is handled, > so may be 4 different progs are needed). > The point is that there are several different entry points with different args. > ia32 user space will call into the kernel differently. > At the end different pieces of the syscall and compat handling will call > into hrtimer_nanosleep with ktime. > So if one bpf prog needs to work for 32 and 64 bit user space it should be > attached to common piece of code like hrtimer_nanosleep(). > If it's attached to syscall entry it needs to attach to at least 2 different > entry points with 2 different progs. I think that while it is true that syscall handlers are different for 32 and 64 bit, it may happen that some syscall handlers will pass user-space pointers down to shared common functions, which may be hooked by ebpf probes. I am no expert, but I think that the cp_new_stat function is such an example; it gets a user struct stat* pointer, whose definition varies across different architectures (32/64-bit, different processor archs etc.) > I guess it's possible to load single prog and kprobe-attach it to > two different kernel functions, but code is kinda different. > For the simplest things like sys_nanosleep it might work and timespec > is simple enough structure to handle with sizeof(long) co-re tricks, > but it's not a generally applicable approach. > So for a tool to track 32-bit and 64-bit user space is quite tricky. > > If bpf prog doesn't care about user space and only needs to deal > with 64-bit kernel + 64-bit user space and 32-bit kernel + 32-bit user space > then it's a bit simpler, but probably still requires attaching > to two different points. On 32-bit x86 kernel there will be no > "fentry/__x64_sys_nanosleep" function. > > > There are probably other examples where UAPI structs use long instead > > of __u32 or __u64, but I didn't dig too deep. > > There is also crazy difference between compact_ioctl vs ioctl. > > The point I'm trying to get across is that the clean 32-bit processing is > a lot more involved than just CORE_READ_USER macro and I want to make sure that > the expections are set correctly. > BPF CO-RE prog may handle 32-bit and 64-bit at the same time, but > it's probably exception than the rule. > > > Let's see if Gilad can provide his perspective. I have no strong > > feelings about this and can send a patch removing CORE_READ_USER > > variants (they didn't make it into libbpf v0.3, so no harm or API > > stability concerns). BPF_PROBE_READ() and BPF_PROBE_READ_USER() are > > still useful for reading non-relocatable, but nested data structures, > > so I'd prefer to keep those. > > Let's hear from Gilad. > I'm not against BPF_CORE_READ_USER macro as-is. I was objecting > to the reasons of implementing it. If I am not mistaken (which is completely possible), I think that providing such a macro will not cause any more confusion than the bpf_probe_read_{,user} distinction already does, since BPF_CORE_READ_USER to BPF_CORE_READ is the same as bpf_probe_read_user to bpf_probe_read.