On Tue, Mar 19, 2019 at 03:48:32PM -0700, Daniel Colascione wrote: > On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner <christian@xxxxxxxxxx> wrote: > > So I dislike the idea of allocating new inodes from the procfs super > > block. I would like to avoid pinning the whole pidfd concept exclusively > > to proc. The idea is that the pidfd API will be useable through procfs > > via open("/proc/<pid>") because that is what users expect and really > > wanted to have for a long time. So it makes sense to have this working. > > But it should really be useable without it. That's why translate_pid() > > and pidfd_clone() are on the table. What I'm saying is, once the pidfd > > api is "complete" you should be able to set CONFIG_PROCFS=N - even > > though that's crazy - and still be able to use pidfds. This is also a > > point akpm asked about when I did the pidfd_send_signal work. > > I agree that you shouldn't need CONFIG_PROCFS=Y to use pidfds. One > crazy idea that I was discussing with Joel the other day is to just > make CONFIG_PROCFS=Y mandatory and provide a new get_procfs_root() > system call that returned, out of thin air and independent of the > mount table, a procfs root directory file descriptor for the caller's > PID namspace and suitable for use with openat(2). Even if this works I'm pretty sure that Al and a lot of others will not be happy about this. A syscall to get an fd to /proc? That's not going to happen and I don't see the need for a separate syscall just for that. (I do see the point of making CONFIG_PROCFS=y the default btw.) Inode allocation from the procfs mount for the file descriptors Joel wants is not correct. Their not really procfs file descriptors so this is a nack. We can't just hook into proc that way. > > C'mon: /proc is used by everyone today and almost every program breaks > if it's not around. The string "/proc" is already de facto kernel ABI. > Let's just drop the pretense of /proc being optional and bake it into > the kernel proper, then give programs a way to get to /proc that isn't > tied to any particular mount configuration. This way, we don't need a > translate_pid(), since callers can just use procfs to do the same > thing. (That is, if I understand correctly what translate_pid does.) I'm not sure what you think translate_pid() is doing since you're not saying what you think it does. Examples from the old patchset: translate_pid(pid, ns, -1) - get pid in our pid namespace translate_pid(pid, -1, ns) - get pid in other pid namespace translate_pid(1, ns, -1) - get pid of init task for namespace translate_pid(pid, -1, ns) > 0 - is pid is reachable from ns? translate_pid(1, ns1, ns2) > 0 - is ns1 inside ns2? translate_pid(1, ns1, ns2) == 0 - is ns1 outside ns2? translate_pid(1, ns1, ns2) == 1 - is ns1 equal ns2? Allowing this syscall to yield pidfds as proper regular file fds and also taking pidfds as argument is an excellent way to kill a few problems at once: - cheap pid namespace introspection - creates a bridge between the "old" pid-based api and the "new" pidfd api - allows us to get proper non-directory file descriptors for any pids we like The additional advantage is that people are already happy to add this syscall so simply extending it and routing it through the pidfd tree or Eric's tree is reasonable. (It should probably grow a flag argument. I need to start prototyping this.) > > We still need a pidfd_clone() for atomicity reasons, but that's a > separate story. My goal is to be able to write a library that Yes, on my todo list and I have a ported patch based on prior working rotting somehwere on my git server. > transparently creates and manages a helper child process even in a > "hostile" process environment in which some other uncoordinated thread > is constantly doing a waitpid(-1) (e.g., the JVM). > > > So instead of going throught proc we should probably do what David has > > been doing in the mount API and come to rely on anone_inode. So > > something like: > > > > fd = anon_inode_getfd("pidfd", &pidfd_fops, file_priv_data, flags); > > > > and stash information such as pid namespace etc. in a pidfd struct or > > something that we then can stash file->private_data of the new file. > > This also lets us avoid all this open coding done here. > > Another advantage is that anon_inodes is its own kernel-internal > > filesystem. > > Sure. That works too. Great.