On Fri, 2023-01-27 at 09:24 -0800, Andrii Nakryiko wrote: > On Fri, Jan 27, 2023 at 8:51 AM Ilya Leoshkevich <iii@xxxxxxxxxxxxx> > wrote: > > > > On Wed, 2023-01-25 at 16:45 -0800, Andrii Nakryiko wrote: > > > On Wed, Jan 25, 2023 at 1:39 PM Ilya Leoshkevich > > > <iii@xxxxxxxxxxxxx> > > > wrote: > > > > > > > > Hi, > > > > > > > > This series implements poke, trampoline, kfunc, mixing subprogs > > > > and > > > > tailcalls, and fixes a number of tests on s390x. > > > > > > > > The following failures still remain: > > > > > > > > #52 core_read_macros:FAIL > > > > Uses BPF_PROBE_READ(), shouldn't there be > > > > BPF_PROBE_READ_KERNEL()? > > > > > > BPF_PROBE_READ(), similarly to BPF_CORE_READ() both use > > > bpf_probe_read_kernel() internally, as it's most common use case. > > > We > > > have separate BPF_PROBE_READ_USER() and BPF_CORE_READ_USER() for > > > when > > > bpf_probe_read_user() has to be used. > > > > At least purely from the code perspective, BPF_PROBE_READ() seems > > to > > delegate to bpf_probe_read(). The following therefore helps with > > this > > test: > > > > --- a/tools/lib/bpf/bpf_core_read.h > > +++ b/tools/lib/bpf/bpf_core_read.h > > @@ -364,7 +364,7 @@ enum bpf_enum_value_kind { > > > > /* Non-CO-RE variant of BPF_CORE_READ_INTO() */ > > #define BPF_PROBE_READ_INTO(dst, src, a, ...) ({ > > \ > > - ___core_read(bpf_probe_read, bpf_probe_read, > > \ > > + ___core_read(bpf_probe_read_kernel, bpf_probe_read_kernel, > > \ > > dst, (src), a, ##__VA_ARGS__) > > \ > > }) > > > > @@ -400,7 +400,7 @@ enum bpf_enum_value_kind { > > > > /* Non-CO-RE variant of BPF_CORE_READ_STR_INTO() */ > > #define BPF_PROBE_READ_STR_INTO(dst, src, a, ...) ({ > > \ > > - ___core_read(bpf_probe_read_str, bpf_probe_read, > > \ > > + ___core_read(bpf_probe_read_kernel_str, > > bpf_probe_read_kernel, > > \ > > dst, (src), a, ##__VA_ARGS__) > > \ > > }) > > > > but I'm not sure if there are backward compatibility concerns, or > > if > > libbpf is supposed to rewrite this when > > !ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE. > > Oh, this is just a bug, it was an omission when we converted > BPF_CORE_READ to bpf_probe_read_kernel. If you look at bpf_core_read > definition, it is using bpf_probe_read_kernel, which is also the > intent for BPF_PROBE_READ. Let's fix this. > > And there is no backwards compat concerns because libbpf will > automatically convert calls to bpf_probe_read_{kernel,user}[_str] to > bpf_probe_read[_str] if host kernel doesn't yet support kernel or > user > specific variants. Thanks for confirming! I will include the fix in v2. > > > > #82 get_stack_raw_tp:FAIL > > > > get_stack_print_output:FAIL:user_stack corrupted user stack > > > > Known issue: > > > > We cannot reliably unwind userspace on s390x without DWARF. > > > > > > like in principle, or frame pointers (or some equivalent) needs > > > to be > > > configured for this to work? > > > > > > Asking also in the context of [0], where s390x was excluded. If > > > there > > > is actually a way to enable frame pointer-based stack unwinding > > > on > > > s390x, would be nice to hear from an expert. > > > > > > [0] https://pagure.io/fesco/issue/2923 > > > > For DWARFless unwinding we have -mbackchain (not to be confused > > with > > -fno-omit-frame-pointer, which we also have, but which just hurts > > performance without providing tangible benefits). > > -mbackchain has a few problems though: > > > > - It's not atomic. Here is a typical prologue with -mbackchain: > > > > 1: stmg %r11,%r15,88(%r15) # save non-volatile > > registers > > 2: lgr %r14,%r15 # %r14 = sp > > 3: lay %r15,-160(%r15) # sp -= 160 > > 4: stg %r14,0(%r15) # *(void **)sp = %r14 > > > > The invariant here is that *(void **)%r15 is always a pointer to > > the > > next frame. This means that if we unwind from (4), we are totally > > broken. This does not happen if we unwind from any other > > instruction, > > but still. > > > > - Unwinding from (1)-(3) is not particularly good either. PSW > > points to > > the callee, but R15 points to the caller's frame. > > > > - Unwinding leaf functions is like the previous problem, but worse: > > they often do not establish a stack frame at all, so PSW and R15 > > are > > out of sync for the entire duration of the call. > > > > Therefore .eh_frame-based unwinding is preferred, since it covers > > all > > these corner cases and is therefore reliable. From what I > > understand, > > adding .eh_frame unwinding to the kernel is not desirable. In an > > internal discussion we had another idea though: would it be > > possible to > > have an eBPF program that does .eh_frame parsing and unwinding? I > > understand that it can be technically challenging at the moment, > > but > > the end result would not be exploitable by crafting malicious > > .eh_frame sections, won't loop endlessly, will have good > > performance, > > etc. > > Thanks for details. This was all discussed at length in Fedora > -fno-omit-frame-pointer discussion I linked above, so no real need to > go over this again. .eh_frame-based unwinding on BPF side is > possible, > but only for processes that you knew about (and preprocessed) before > you started profiling session. Pre-processing is memory and > cpu-intensive operation on busy systems, and they will miss any > processes started during profiling. So as a general approach for > system-wide profiling it leaves a lot to be desired. Thanks for the explanation, I'll read the thread and come back if I get some new ideas not listed here. > Should we enable -mbackchain in our CI for s390x to be able to > capture > stack traces (even if on some instructions they might be incomplete > or > outright broken)? Let's do it, I don't have anything against this. [...] > > > > > > Here is the full log: > > > > https://gist.github.com/iii-i/8e20100c33ab6f0dffb5e6e51d1330e8 > > > > Apparently we do indeed lose a constraint established by > > if (hdr->tcp_len < sizeof(*hdr->tcp)). But the naive > > The test is too big and unfamiliar for me to figure this out. And the > problem is not upper bound, but lower bound. hdr->tcp_len is not > proven to be strictly greater than zero, which is what verifier > complains about. Not sure how it works on other arches right now. > > > But I see that bpf_csum_diff defines size arguments as > ARG_CONST_SIZE_OR_ZERO while bpf_tcp_raw_gen_syncookie_ipv4 has > ARG_CONST_SIZE. I generally found ARG_CONST_SIZE way too problematic > in practice, I'd say we should change it to ARG_CONST_SIZE_OR_ZERO. Yes, this helps, and doesn't seem to introduce issues, since the minimum length is enforced inside this function anyway. I will include the change for bpf_tcp_raw_gen_syncookie_ipv{4,6} in v2; I guess some other functions may benefit from this as well, but this is outside the scope of this series. [...]