2023-03-04 15:39 UTC-0800 ~ Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> > On Wed, Mar 1, 2023 at 1:09 PM Dmitry Dolgov <9erthalion6@xxxxxxxxx> wrote: >> >>> On Wed, Mar 01, 2023 at 11:02:25AM -0800, Andrii Nakryiko wrote: >>> >>>> Use libbpf_strerror_r to expand the error when failed to parse the btf >>>> file at btf_custom_path. It does not change a lot locally, but since the >>>> error will bubble up through a few layers, it may become quite >>>> confusing otherwise. As an example here is what happens when the file >>>> indicated via btf_custom_path does not exist and the caller uses >>>> strerror as well: >>>> >>>> libbpf: failed to parse target BTF: -2 >>>> libbpf: failed to perform CO-RE relocations: -2 >>>> libbpf: failed to load object 'bpf_probe' >>>> libbpf: failed to load BPF skeleton 'bpf_probe': -2 >>>> [caller]: failed to load BPF object (errno: 2 | message: No such file or directory) >>>> >>>> In this context "No such file or directory" could be easily >>>> misinterpreted as belonging to some other part of loading process, e.g. >>>> the BPF object itself. With this change it would look a bit better: >>>> >>>> libbpf: failed to parse target BTF: No such file or directory >>>> libbpf: failed to perform CO-RE relocations: -2 >>>> libbpf: failed to load object 'bpf_probe' >>>> libbpf: failed to load BPF skeleton 'bpf_probe': -2 >>>> [caller]: failed to load BPF object (errno: 2 | message: No such file or directory) >>> >>> I find these text-only error messages more harmful, actually. Very >>> often their literal meaning is confusing, and instead the process is >>> to guess what's -Exxx error they represent, and go from there. >>> >>> Recently me and Quentin discussed moving towards an approach where >>> we'd log both symbolic error value (-EPERM instead of -1) and also >>> human-readable text message. So I'd prefer us figuring out how to do >>> this ergonomically in libbpf and bpftool code base, and start moving >>> in that direction. >> >> Fair enough, thanks. I would love to try out any suggestions in this >> area -- we were recently looking into error handling, and certain parts >> were suboptimal. >> >> Talking about confusing text error messages, I'm curious about -ESRCH >> usage. It's being used in libbpf and various subsystem as well to >> indicate that something wasn't found, so I guess it's an established >> practice. But then in case btf__load_vmlinux_btf can't find a proper >> file and reports an error, the caller gets surprising "No such process" >> out of strerror. Am I missing something, is it implemented like this on >> purpose? > > It's probably not 100% consistent throughout libbpf, but -ESRCH is > used to denote "a process to determine/find something failed". -ENOENT > is used when we are requested to find a specific entry, and it's not > there (but otherwise there were no errors encountered). That's the > distinction. > > The problem with those text explanations of errors is that they are > coming from Linux's usage of them in the context of process or file > manipulations, and I don't see a way around that. I'd like to minimize > the use of custom error codes. > > But this is the reason I'd like to output `-ESRCH` instead of either > -3 or "No such process". Something like "-ESRCH (No such process)" is > a compromise, but better than nothing. > > Or we could stick to just -ESRCH. That might be better than test > descriptions, as we at least don't confuse them with irrelevant > descriptions. > > But Quentin might find it not very user-friendly for his bpftool use > cases, probably. Yes, even though the messages are sometimes confusing I find that there are also occasions where they're actually useful to users not familiar with the error names (not easy to figure out what "ESRCH" means if you've never seen it before), so I'd avoid removing them entirely from bpftool. Just as Andrii writes, we talked [0] on displaying both the error name and the description through libbpf_strerror_r(): Error: can't get next program: [-EPERM] Operation not permitted So that users with more knowledge can skip the description and just look at the error name. I haven't started to work on this, though. [0] https://lore.kernel.org/all/CAEf4BzZMJGrRhNeQeWB0fRsuRYUv01aZGhvDeFV2o5zdpRbR-w@xxxxxxxxxxxxxx/