Re: Extending bpf_get_ns_current_pid_tgid()

Yonghong Song <yhs@xxxxxx> · Thu, 12 Nov 2020 20:13:21 -0800

On 11/12/20 4:57 PM, Daniel Xu wrote:
On Thu Nov 12, 2020 at 4:27 PM PST, Yonghong Song wrote:

On 11/12/20 2:20 PM, Daniel Xu wrote:
Hi,

I'm looking at the current implementation of
bpf_get_ns_current_pid_tgid() and the helper seems to be a bit overly
restricting to me. Specifically the following line:

      if (!ns_match(&pidns->ns, (dev_t)dev, ino))
              goto clear;

Why bail if the inode # does not match? IIUC from the old discussions,
it was b/c in the future pidns files might belong to different devices.
It's not clear to me (possibly b/c I'm missing something) why the inode
has to match as well.

Yes, pidns file might belong to different devices in theory so we need
to match dev as well.

The inode number needs to match so we can ensure user indeed wants to
get the *current pidns* tgid/pid.

Right, this double-checking at the API level is what feels strange to
me -- why make the user prove they know what they're doing?

If we do not have this checking, it is possible that in interrupt
context, pidns #10 user may get a tgid/pid actually from pisns #11,
and tgid/pid could be valid for pidns #10. This result will be
actually wrong.

Furthermore, the "proof" restricts flexibility. It's as if
bpf_get_current_task() required a (dev,ino) pair. How would you get the
namespaced pid for a process you don't know about yet? eg when you're
profiling the system.

Did not fully understand questions here. Do you mean
  bpf_get_current_task(dev, ino)
that will be weird. task is not associated with dev/ino.

(dev, ino) input expressed user intention. Without this, in no-process
context, it will be hard to interpret the results.

But bpf_get_current_pid_tgid() doesn't return errors so this shouldn't
either, right?

Different helpers can have different signatures.

Would it be possible to instead have the helper return the pid/tgid of
the current task as viewed _from_ the `dev`/`ino` pidns? If the current
task is hidden from the `dev`/`ino` pidns, then return -ENOENT. The use
case is for bpftrace symbolize stacks when run inside a container. For
example:

      (in-container)# bpftrace -e 'profile:hz:99 { print(ustack) }'

I think you try to propose something like below:
- user provides dev/ino
- the helper will try to go through all pidns'es (not just active
one), if any match pidns match, returns tgid/pid in that pidns,
otherwise, returns -ENOENT.

Right, exactly.

If you want to do this, you will need a new helper like
  bpf_get_ns_pid_tgid

It actually will be weird to use this helper as it looks like
you try to get pid/tgid of another ns. So we do need to nail
down the use case here.

The current helper is
bpf_get_ns_current_pid_tgid
you want
bpf_get_ns_pid_tgid

I think it is possible, you need to check
pid->numbers[pid_level].ns
for all pid levels. You need to get a reference count for the namespace
to ensure valid result.

This may work for root inode, but for container inode, it may have
issues. For example,
container 1: create, inode 2
container 1 removed
container 2: create, inode 2
If you use inode 2, depending on timing you may accidentally targetting
wrong container.

Yeah, so maybe an fd to /proc/<pid>/ns/pid or something.

I think you can workaround the issue without this helper. See below.

This currently does not work b/c bpftrace will generate a prog that gets
the root pidns pid, pack it with the stackid, and pass it up to
userspace. But b/c bpftrace is running inside the container, the root
pidns pid is invalid and symbolization fails.

bpftrace can generate a program takes dev/inode as input parameters in
map. The bpftrace will supply dev/inode value, by query the current
system/container, and then run the program.

I don't think it's very feasible to have bpftrace integrate with every
container runtime out there. This also becomes really difficult to
manage if you want to trace N processes. You'd need N maps or N progs.

Why, just one map to store dev/inode is shared among all progs, right?

What would be nice is if bpftrace could generate a prog that gets the
current pid as viewed from bpftrace's pidns. Then symbolization would
work.

Despite the above workaround, what you really need is although it is
running on container, you want to get stack trace interpreted with
root pid/tgid for symbolization purpose? But you can already achieve
this with bpf_get_pid_tgid()?

No, this isn't possible when bpftrace runs inside the container. ie
bpftrace is in a pidns along with the tracees. Bpftrace gets the root
pidns pid from the kernel but cannot resolve it to the pidns pid. That
means bpftrace cannot find the executable file to symbolize against.

Not sure whether I understand correct or not. You want root pid to
find exec, right? but bpf_get_pid_tgid() will give your root pid?
Maybe I miss something here...

[...]

Thanks,
Daniel