> On 11/12/20 4:57 PM, Daniel Xu wrote: >> On Thu Nov 12, 2020 at 4:27 PM PST, Yonghong Song wrote: >>> >>> >>> On 11/12/20 2:20 PM, Daniel Xu wrote: >>>> Hi, >>>> >>>> I'm looking at the current implementation of >>>> bpf_get_ns_current_pid_tgid() and the helper seems to be a bit overly >>>> restricting to me. Specifically the following line: >>>> >>>> if (!ns_match(&pidns->ns, (dev_t)dev, ino)) >>>> goto clear; >>>> >>>> Why bail if the inode # does not match? IIUC from the old discussions, >>>> it was b/c in the future pidns files might belong to different devices. >>>> It's not clear to me (possibly b/c I'm missing something) why the inode >>>> has to match as well. >>> >>> Yes, pidns file might belong to different devices in theory so we need >>> to match dev as well. >>> >>> The inode number needs to match so we can ensure user indeed wants to >>> get the *current pidns* tgid/pid. >> >> Right, this double-checking at the API level is what feels strange to >> me -- why make the user prove they know what they're doing? > > If we do not have this checking, it is possible that in interrupt > context, pidns #10 user may get a tgid/pid actually from pisns #11, > and tgid/pid could be valid for pidns #10. This result will be > actually wrong. > >> >> Furthermore, the "proof" restricts flexibility. It's as if >> bpf_get_current_task() required a (dev,ino) pair. How would you get the >> namespaced pid for a process you don't know about yet? eg when you're >> profiling the system. > > Did not fully understand questions here. Do you mean > bpf_get_current_task(dev, ino) > that will be weird. task is not associated with dev/ino. > >> >>> >>> (dev, ino) input expressed user intention. Without this, in no-process >>> context, it will be hard to interpret the results. >> >> But bpf_get_current_pid_tgid() doesn't return errors so this shouldn't >> either, right? > > Different helpers can have different signatures. > >> >>> >>>> >>>> Would it be possible to instead have the helper return the pid/tgid of >>>> the current task as viewed _from_ the `dev`/`ino` pidns? If the current >>>> task is hidden from the `dev`/`ino` pidns, then return -ENOENT. The use >>>> case is for bpftrace symbolize stacks when run inside a container. For >>>> example: >>>> >>>> (in-container)# bpftrace -e 'profile:hz:99 { print(ustack) }' >>> >>> I think you try to propose something like below: >>> - user provides dev/ino >>> - the helper will try to go through all pidns'es (not just active >>> one), if any match pidns match, returns tgid/pid in that pidns, >>> otherwise, returns -ENOENT. >> >> Right, exactly. > > If you want to do this, you will need a new helper like > bpf_get_ns_pid_tgid > > It actually will be weird to use this helper as it looks like > you try to get pid/tgid of another ns. So we do need to nail > down the use case here. I'll try and describe the use case I have in mind. I expect folks would like to use bpftrace in container X to trace events in container Y, where X may or not be Y, provided the bpftrace process has the required access to Y's namespace. For symbolization to work, bpftrace needs to get the pid of processes in container Y but in the namespace of X. For example X could be a parent cgroup to multiple workloads including X, and the owner of these workloads has access to X but not the host itself. I don't see it as trying to get pid/tgid from another namespace, it's actually to get the pid in the namespace where it can be acted upon (i.e. X). >>> The current helper is >>> bpf_get_ns_current_pid_tgid >>> you want >>> bpf_get_ns_pid_tgid >>> >>> I think it is possible, you need to check >>> pid->numbers[pid_level].ns >>> for all pid levels. You need to get a reference count for the namespace >>> to ensure valid result. >>> >>> This may work for root inode, but for container inode, it may have >>> issues. For example, >>> container 1: create, inode 2 >>> container 1 removed >>> container 2: create, inode 2 >>> If you use inode 2, depending on timing you may accidentally targetting >>> wrong container. >> >> Yeah, so maybe an fd to /proc/<pid>/ns/pid or something. >> >>> >>> I think you can workaround the issue without this helper. See below. >>> >>>> >>>> This currently does not work b/c bpftrace will generate a prog that gets >>>> the root pidns pid, pack it with the stackid, and pass it up to >>>> userspace. But b/c bpftrace is running inside the container, the root >>>> pidns pid is invalid and symbolization fails. >>> >>> bpftrace can generate a program takes dev/inode as input parameters in >>> map. The bpftrace will supply dev/inode value, by query the current >>> system/container, and then run the program. >> >> I don't think it's very feasible to have bpftrace integrate with every >> container runtime out there. This also becomes really difficult to >> manage if you want to trace N processes. You'd need N maps or N progs. > > Why, just one map to store dev/inode is shared among all progs, right? > >> >>> >>>> >>>> What would be nice is if bpftrace could generate a prog that gets the >>>> current pid as viewed from bpftrace's pidns. Then symbolization would >>>> work. >>> >>> Despite the above workaround, what you really need is although it is >>> running on container, you want to get stack trace interpreted with >>> root pid/tgid for symbolization purpose? But you can already achieve >>> this with bpf_get_pid_tgid()? >> >> No, this isn't possible when bpftrace runs inside the container. ie >> bpftrace is in a pidns along with the tracees. Bpftrace gets the root >> pidns pid from the kernel but cannot resolve it to the pidns pid. That >> means bpftrace cannot find the executable file to symbolize against. > > Not sure whether I understand correct or not. You want root pid to > find exec, right? but bpf_get_pid_tgid() will give your root pid? > Maybe I miss something here... Looks like your suggestion is for bpftrace to use bpf_get_pid_tgid() when it's running in the root namespace (i.e. the host), and bpf_get_ns_current_pid_tgid() when it's running in another namespace (i.e. a container). I think this would be fine in the short term, even though it doesn't cover all the cases (see above). That said, I'd say the intelligence belongs more in a bpf helper than in user space. Thanks, Blaise