On Mon, Aug 5, 2019 at 6:11 PM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Mon, Aug 05, 2019 at 02:25:35PM -0700, Andy Lutomirski wrote: > > It tries to make the kernel respect the access modes for fds. Without > > this patch, there seem to be some holes: nothing looked at program fds > > and, unless I missed something, you could take a readonly fd for a > > program, pin the program, and reopen it RW. > > I think it's by design. iirc Daniel had a use case for something like this. That seems odd. Daniel, can you elaborate? > Hence unprivileged bpf is actually something that can be deprecated. I hope not. There are a couple setsockopt uses right now, and and seccomp will surely want it someday. And the bpf-inside-container use case really is unprivileged bpf -- containers are, in many (most?) cases, explicitly not trusted by the host. But regardless, see below. > > The ability to obtain an fd to a cgroup does *not* imply any right to > > modify that cgroup. The ability to write to a cgroup directory > > already means something else -- it's the ability to create cgroups > > under the group in question. I'm suggesting that a new API be added > > that allows attaching a bpf program to a cgroup without capabilities > > and that instead requires write access to a new file in the cgroup > > directory. (It could be a single file for all bpf types or one file > > per type. I prefer the latter -- it gives the admin finer-grained > > control.) > > This is something to discuss. I don't mind something like this, > but in general bpf is not for untrusted users. > Hence I don't want to overdesign. I think the code would end up being extremely simple. It would be an ABI change, but I think that it could easily be arranged so that new and old libbpf would be fully functional on new and old kernels except that only new libbpf with a new kernel would be able to attach cgroup programs without privilege. I see no reason to remove the current cgroup attach API. > See https://github.com/systemd/systemd/blob/01234e1fe777602265ffd25ab6e73823c62b0efe/src/core/bpf-firewall.c#L671-L674 > bpf based IP sandboxing doesn't work in 'systemd --user'. > That is just one of the problems that people complained about. This isn't privilege *dropping* -- this is not having privilege in the first place. Giving systemd --user unrestricted use of bpf() is tantamount to making the user root. But again, see below. > > Inside containers and inside nested containers we need to start processes > that will use bpf. All of the processes are trusted. Trusted by whom? In a non-nested container, the container manager *might* be trusted by the outside world. In a *nested* container, unless the inner container management is controlled from outside the outer container, it's not trusted. I don't know much about how Facebook's containers work, but the LXC/LXD/Podman world is moving very strongly toward user namespaces and maximally-untrusted containers, and I think bpf() should work in that context. > To solve your concern of bypassing all capable checks... > How about we do /dev/bpf/full_verifier first? > It will replace capable() checks in the verifier only. I'm not convinced that "in the verifier" is the right distinction. Telling administrators that some setting lets certain users bypass bpf() verifier checks doesn't have a clear enough meaning. I propose, instead, that the current capable() checks be divided into three categories: a) Those that, by design, control privileged operations. This includes most attach calls, but it also includes allow_ptr_leaks, bpf_probe_read(), and quite a few other things. It also includes all of the by_id calls, I think, unless some clever modification to the way they worked would isolate different users' objects. I think that persistent objects can do pretty much everything that by_id users would need, so this isn't a big deal. b) Those that protect code that is considered to be at higher risk of exposing security bugs. In other words, these are things that would have no need for privilege if we could prove that the implementation was perfect. This includes bpf2bpf, LPM, etc. c) Those that serve to prevent excessive resource usage. This includes the bounded loop code (I think -- this might also be category (b)) and several explicit checks for program size. I suppose that the speculative execution mitigation stuff could be split out from (b) into its own category. Based on this, how about a different division: /dev/bpf_dangerous for (b) and /dev/bpf_resources for (c)? And (a) is dealt with by gradually reducing the privilege needed for the important operations. /dev/bpf_dangerous isn't ideal in the long run though, since a lot of the operations in that category will probably become less dangerous as the code is better vetted, and people granting "dangerous" permission might want to restrict it to only the specific parts that they need. /dev/bpf_lpm, /dev/bpf2bpf, etc could work better. This type of thing actually fits quite nicely into an idea I've been thinking about for a while called "implicit rights". In very brief summary, there would be objects called /dev/rights/xyz, where xyz is the same of a "right". If there is a readable object of the right type at the literal path "/dev/rights/xyz", then you have right xyz. There's a bit more flexibility on top of this. BPF could use /dev/rights/bpf/maptypes/lpm and /dev/rights/bpf/verifier/bounded_loops, for example. Other non-BPF use cases include a biggie: /dev/rights/namespace/create_unprivileged_userns. /dev/rights/bind_port/80 would be nice, too. I may have a bit of time over the next few days to actually prototype this. Would you be interested in using this for BPF?