On Wed, Jan 6, 2021 at 2:10 AM Gilad Reti <gilad.reti@xxxxxxxxx> wrote: > > On Wed, Jan 6, 2021 at 8:09 AM Alexei Starovoitov > <alexei.starovoitov@xxxxxxxxx> wrote: > > > > On Tue, Jan 05, 2021 at 01:03:47PM -0800, Andrii Nakryiko wrote: > > > > > > > > > > - handling 32-bit vs 64-bit UAPI structs uniformly; > > > > > > > > what do you mean? > > > > 32-bit user space running on 64-bit kernel works through 'compat' syscalls. > > > > If bpf progs are accessing 64-bit uapi structs in such case they're broken > > > > and no amount of co-re can help. > > > > > > I know nothing about compat, so can't comment on that. But the way I > > > understood the situation was the BPF program compiled once (well, at > > > least from the unmodified source code), but runs on ARM32 and (on a > > > separate physical host) on ARM64. And it's task is, say, to read UAPI > > > kernel structures from syscall arguments. > > > > I'm not sure about arm, but on x86 there should be two different progs > > for this to work (if we're talking about 32-bit userspace). > > See below. > > > > > One simple example I found in UAPI definitions is struct timespec, it > > > seems it's defined with `long`: > > > > > > struct timespec { > > > __kernel_old_time_t tv_sec; /* seconds */ > > > long tv_nsec; /* nanoseconds */ > > > } > > > > > > So if you were to trace clock_gettime(), you'd need to deal with > > > differently-sized reads of tv_nsec, depending on whether you are > > > running on the 32-bit or 64-bit host. > > > > I believe gettime is vdso, so that's not a syscall and there are no bpf hooks > > available to track that. > > For normal syscall with timespec argument like sys_recvmmsg and sys_nanosleep > > the user needs to load two programs and attach to two different points: > > fentry/__x64_sys_nanosleep and fentry/__ia32_sys_nanosleep. > > (I'm not an expert on compat and not quite sure how _time32 suffix is handled, > > so may be 4 different progs are needed). > > The point is that there are several different entry points with different args. > > ia32 user space will call into the kernel differently. > > At the end different pieces of the syscall and compat handling will call > > into hrtimer_nanosleep with ktime. > > So if one bpf prog needs to work for 32 and 64 bit user space it should be > > attached to common piece of code like hrtimer_nanosleep(). > > If it's attached to syscall entry it needs to attach to at least 2 different > > entry points with 2 different progs. > > I think that while it is true that syscall handlers are different for > 32 and 64 bit, > it may happen that some syscall handlers will pass user-space pointers down to > shared common functions, which may be hooked by ebpf probes. I am no expert, but > I think that the cp_new_stat function is such an example; it gets a > user struct stat* pointer, > whose definition varies across different architectures (32/64-bit, > different processor archs etc.) > > > I guess it's possible to load single prog and kprobe-attach it to > > two different kernel functions, but code is kinda different. > > For the simplest things like sys_nanosleep it might work and timespec > > is simple enough structure to handle with sizeof(long) co-re tricks, > > but it's not a generally applicable approach. > > So for a tool to track 32-bit and 64-bit user space is quite tricky. > > > > If bpf prog doesn't care about user space and only needs to deal > > with 64-bit kernel + 64-bit user space and 32-bit kernel + 32-bit user space > > then it's a bit simpler, but probably still requires attaching > > to two different points. On 32-bit x86 kernel there will be no > > "fentry/__x64_sys_nanosleep" function. > > > > > There are probably other examples where UAPI structs use long instead > > > of __u32 or __u64, but I didn't dig too deep. > > > > There is also crazy difference between compact_ioctl vs ioctl. > > > > The point I'm trying to get across is that the clean 32-bit processing is > > a lot more involved than just CORE_READ_USER macro and I want to make sure that > > the expections are set correctly. > > BPF CO-RE prog may handle 32-bit and 64-bit at the same time, but > > it's probably exception than the rule. > > > > > Let's see if Gilad can provide his perspective. I have no strong > > > feelings about this and can send a patch removing CORE_READ_USER > > > variants (they didn't make it into libbpf v0.3, so no harm or API > > > stability concerns). BPF_PROBE_READ() and BPF_PROBE_READ_USER() are > > > still useful for reading non-relocatable, but nested data structures, > > > so I'd prefer to keep those. > > > > Let's hear from Gilad. > > I'm not against BPF_CORE_READ_USER macro as-is. I was objecting > > to the reasons of implementing it. > > If I am not mistaken (which is completely possible), I think that > providing such a macro will > not cause any more confusion than the bpf_probe_read_{,user} > distinction already does, > since BPF_CORE_READ_USER to BPF_CORE_READ is the same as bpf_probe_read_user > to bpf_probe_read. I think the biggest source of confusion is that USER part in BPF_CORE_READ_USER refers to reading data from user address space, not really user structs (which is kind of natural instinct here). CO-RE *always* works only with kernel types, which is obvious if you have a lot of experience with using CO-RE, but not initially, unfortunately. So in the end, while a narrow use case, it seems like it might be useful to have BPF_CORE_READ_USER. Maybe emphasizing (in the doc comment) the fact that all types need to be kernel types would help a bit? If not, it's not a big deal, because it's pretty easy for users to just explicitly write bpf_probe_read_user(&dst, sizeof(dst), __builtin_preserve_access_index(src)). But there is definitely a bit of mental satisfaction with having all possible combinations of CORE/non-CORE and USER/KERNEL variants :) Btw, it seems these patches are already in bpf-next, so, Alexei, please let me know if you insist on removing BPF_CORE_READ_USER() variant and I'll send a follow up patch.