On Mon, Mar 25, 2019 at 1:14 PM Jann Horn <jannh@xxxxxxxxxx> wrote: > > On Mon, Mar 25, 2019 at 8:44 PM Andy Lutomirski <luto@xxxxxxxxxx> wrote: > > On Wed, Mar 20, 2019 at 12:40 PM Daniel Colascione <dancol@xxxxxxxxxx> wrote: > > > On Wed, Mar 20, 2019 at 12:14 PM Christian Brauner <christian@xxxxxxxxxx> wrote: > > > > On Wed, Mar 20, 2019 at 11:58:57AM -0700, Andy Lutomirski wrote: > > > > > On Wed, Mar 20, 2019 at 11:52 AM Christian Brauner <christian@xxxxxxxxxx> wrote: > > > > > > > > > > > > You're misunderstanding. Again, I said in my previous mails it should > > > > > > accept pidfds optionally as arguments, yes. But I don't want it to > > > > > > return the status fds that you previously wanted pidfd_wait() to return. > > > > > > I really want to see Joel's pidfd_wait() patchset and have more people > > > > > > review the actual code. > > > > > > > > > > Just to make sure that no one is forgetting a material security consideration: > > > > > > > > Andy, thanks for commenting! > > > > > > > > > > > > > > $ ls /proc/self > > > > > attr exe mountinfo projid_map status > > > > > autogroup fd mounts root syscall > > > > > auxv fdinfo mountstats sched task > > > > > cgroup gid_map net schedstat timers > > > > > clear_refs io ns sessionid timerslack_ns > > > > > cmdline latency numa_maps setgroups uid_map > > > > > comm limits oom_adj smaps wchan > > > > > coredump_filter loginuid oom_score smaps_rollup > > > > > cpuset map_files oom_score_adj stack > > > > > cwd maps pagemap stat > > > > > environ mem personality statm > > > > > > > > > > A bunch of this stuff makes sense to make accessible through a syscall > > > > > interface that we expect to be used even in sandboxes. But a bunch of > > > > > it does not. For example, *_map, mounts, mountstats, and net are all > > > > > namespace-wide things that certain policies expect to be unavailable. > > > > > stack, for example, is a potential attack surface. Etc. > > > > > > If you can access these files sources via open(2) on /proc/<pid>, you > > > should be able to access them via a pidfd. If you can't, you > > > shouldn't. Which /proc? The one you'd get by mounting procfs. I don't > > > see how pidfd makes any material changes to anyone's security. As far > > > as I'm concerned, if a sandbox can't mount /proc at all, it's just a > > > broken and unsupported configuration. > > > > It's not "broken and unsupported". I know of an actual working, > > deployed container-ish sandbox that does exactly this. I would also > > guess that quite a few not-at-all-container-like sandboxes work like > > this. (The obvious seccomp + unshare + pivot_root > > deny-myself-access-to-lots-of-things trick results in no /proc, which > > is by dsign.) > > > > > > > > An actual threat model and real thought paid to access capabilities > > > would help. Almost everything around the interaction of Linux kernel > > > namespaces and security feels like a jumble of ad-hoc patches added as > > > afterthoughts in response to random objections. > > > > I fully agree. But if you start thinking for real about access > > capabilities, there's no way that you're going to conclude that a > > capability to access some process implies a capability to access the > > settings of its network namespace. > > > > > > > > >> All these new APIs either need to > > > > > return something more restrictive than a proc dirfd or they need to > > > > > follow the same rules. > > > > > > > ... > > > > > What's special about libraries? How is a library any worse-off using > > > openat(2) on a pidfd than it would be just opening the file called > > > "/proc/$apid"? > > > > Because most libraries actually work, right now, without /proc. Even > > libraries that spawn subprocesses. If we make the new API have the > > property that it doesn't work if you're in a non-root user namespace > > and /proc isn't mounted, the result will be an utter mess. > > > > > > > > > > Yes, this is unfortunate, but it is indeed the current situation. I > > > > > suppose that we could return magic restricted dirfds, or we could > > > > > return things that aren't dirfds and all and have some API that gives > > > > > you the dirfd associated with a procfd but only if you can see > > > > > /proc/PID. > > > > > > > > What would be your opinion to having a > > > > /proc/<pid>/handle > > > > file instead of having a dirfd. Essentially, what I initially proposed > > > > at LPC. The change on what we currently have in master would be: > > > > https://gist.github.com/brauner/59eec91550c5624c9999eaebd95a70df > > > > > > And how do you propose, given one of these handle objects, getting a > > > process's current priority, or its current oom score, or its list of > > > memory maps? As I mentioned in my original email, and which nobody has > > > addressed, if you don't use a dirfd as your process handle or you > > > don't provide an easy way to get one of these proc directory FDs, you > > > need to duplicate a lot of metadata access interfaces. > > > > An API that takes a process handle object and an fd pointing at /proc > > (the root of the proc fs) and gives you back a proc dirfd would do the > > trick. You could do this with no new kernel features at all if you're > > willing to read the pid, call openat(2), and handle the races in user > > code. > > This seems like something that might be a good fit for two ioctls? As an aside, we had a long discussion about why fundamental facilities like this should be system calls, not ioctls. I think the arguments still apply. > One ioctl on procfs roots to translate pidfds into that procfs, > subject to both the normal lookup permission checks and only working > if the pidfd has a translation into the procfs: > > int proc_root_fd = open("/proc", O_RDONLY); > int proc_dir_fd = ioctl(proc_root_fd, PROC_PIDFD_TO_PROCFSFD, pidfd); > > And one ioctl on procfs directories to translate from PGIDs and PIDs to pidfds: > > int proc_pgid_fd = open("/proc/self", O_RDONLY); > int self_pg_pidfd = ioctl(proc_pgid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > int proc_pid_fd = open("/proc/thread-self", O_RDONLY); > int self_p_pidfd = ioctl(proc_pid_fd, PROC_PROCFSFD_TO_PIDFD, 0); > > > And then, as you proposed, the new sys_clone() can just return a > pidfd, and you can convert it into a procfs fd yourself if you want. I think that's the consensus we reached on the other thread. The O_DIRECTORY open on /proc/self/fd/mypidfd seems like it'd work well enough.