On Mon, Aug 5, 2019 at 12:21 PM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Mon, Aug 05, 2019 at 10:23:10AM -0700, Andy Lutomirski wrote: > > > > I refreshed the branch again. I had a giant hole in my previous idea > > that we could deprivilege program loading: some BPF functions need > > privilege. Now I have a changelog comment to that effect and a patch > > that sketches out a way to addressing this. > > > > I don't think I'm going to have time soon to actually get any of this > > stuff mergeable, and it would be fantastic if you or someone else who > > likes working of bpf were to take this code and run with it. Feel > > free to add my Signed-off-by, and I'd be happy to help review. > > Thanks a lot for working on patches and helping us with the design! > > Can you resend the patches to the mailing list? > It's kinda hard to reply/review to patches that are somewhere in the web. Will do. > I'm still trying to understand the main idea. > If I'm reading things correctly: The series doesn't, strictly speaking, have an overall problem that it solves. It's a series of steps in the direction of making bpf() make more sense without privilege and toward reducing the required privilege. > patch 1 "add access permissions to bpf fds" > just passes the flags ? It tries to make the kernel respect the access modes for fds. Without this patch, there seem to be some holes: nothing looked at program fds and, unless I missed something, you could take a readonly fd for a program, pin the program, and reopen it RW. > patch 2 "Don't require mknod() permission to pin an object" > makes sense in isolation. It makes even more sense now :) > patch 3 "Allow creating all program types without privilege" > is not right. I think it can be made right, which is the point. > patch 4 "Add a way to mark functions as requiring privilege" > is an interesting idea, but I don't think it helps that much. Other than the issue that this patch partially fixes, can you see any reason that loading a program should require privilege? Obviously the verifier is weakened a bit when called by privileged users, but a lot of that is about excessive resource usage and various less-well-tested features. It seems to me that most of the value of bpf() should be available to programs that should not need privilege to load. Are there things I'm missing? > > So the main thing we're trying to solve with augmented bpf syscall > and/or /dev/bpf is to be able to use root-only features of bpf when > trused process already dropped root permissions. > These features include bpf2bpf calls, bounded loops, special maps (like LPM), etc. Can you elaborate on all these: I see nothing inherently wrong with bpf2bpf for unprivileged users as long as they have appropriate access to the called program. Patch 1 improves that. Bounded loops: if they are adequately well verified, then the only damage is that they can make bpf progs that run slowly, right? It seems like some kind of capability or sysctl for "allow using lots of bpf resources" would do the trick. This could even be a cgroup setting -- bpf resources aren't all that different from any other resource. LPM: I don't see why this requires privilege at all. It indeed checks capable(CAP_SYS_ADMIN), but I don't see why. > > Attaching to a cgroup already has file based permission checks. > The user needs to open cgroup directory to attach. > acls on cgroup dir can already be used to prevent attaching to > certain parts of cgroup hierarchy. The current checks seem inadequate. $ echo 'yay' </sys/fs/cgroup/systemd/system.slice/ The ability to obtain an fd to a cgroup does *not* imply any right to modify that cgroup. The ability to write to a cgroup directory already means something else -- it's the ability to create cgroups under the group in question. I'm suggesting that a new API be added that allows attaching a bpf program to a cgroup without capabilities and that instead requires write access to a new file in the cgroup directory. (It could be a single file for all bpf types or one file per type. I prefer the latter -- it gives the admin finer-grained control.) > What we need is to drop privileges sooner in daemons like systemd. This is doable right now: systemd could fork off a subprocess and delegate its cgroup operations to it. It would be maybe a couple hundred lines of code. As an added benefit, that subprocess could verify that the bpf operations in question are reasonable. Alternatively, if there was a CAP_BPF_ADMIN, systemd could retain that capability and flip it on and off as needed. > Container management daemon runs in the nested containers. > These trusted daemons need to have access to full bpf, but they > don't want to be root all the time. > They cannot flip back and forth via seteuid to root every time they > need to do bpf. > Hence the idea is to have a file that this daemon can open, > then drop privileges and still keep doing bpf things because FD is held. > Outer container daemon can pass this /dev/bpf's FD to inner daemon, etc. > This /dev/bpf would be accessible to root only. > There is no desire to open it up to non-root. This seems extremely dangerous right now. A program that can bypass *all* of the capable() checks in bpf() can do a whole lot. Among other things, it can read all of kernel memory. It can very likely gain full system root by appropriate installation of malicious programs in a cgroup that contains fully privileged programs. In this regard, bpf() is like most of the Linux capabilities -- it seems somewhat limited, but it really implies a lot of privilege. There was a little paper awhile back pointing out that, on a normal system, most of the Linux capabilities were functionally equivalent. > > It seems there is concern that /dev/bpf is unnecessary special. > How about we combine bpffs and /dev/bpf ideas? > Like we can have a special file name in bpffs. > The root would do 'touch /sys/fs/bpf/privileges' and it would behave > just like /dev/bpf, but now it can be in any bpffs directory and acls > to bpffs mount would work as-is. This seems to have most of the same problems. My main point is that it conflates a whole lot of different permissions, and I really don't think it's that much work to mostly disentangle the permissions in question. My little series (if completed) plus a patch to allow unprivileged cgroup attach operations if you have an FMODE_WRITE fd to an appropriate file should get most of the way there. Also, be careful about your bpffs idea: bpffs is (sort of) namespaced, and it would make sense to allow new bpf instances to be created inside unprivileged user namespaces. Such instances should not be able to create magical privilege-granting files. In that respect, /dev/bpf is better. > > CAP_BPF is also good idea. I think for the enviroment where untrusted > and unprivileged users want to run 'bpftrace' that would be perfect mechanism. > getcap /bin/bpftrace would have cap_bpf, cap_kprobe and whatever else. > Sort of like /bin/ping. > But I don't see how cap_bpf helps to solve our trusted root daemon problem. > imo open ("/sys/fs/bpf/privileges") and pass that FD into bpf syscall > is the only viable mechanism. > As above, I think that forking before dropping privileges and asking the child to do the bpf() operations is safer and more flexible. > Note the verifier does very different amount of work for unpriv vs root. > It does speculative execution analysis, pointer leak checks for unpriv. > So we gotta pass special flag to the verifier to make it act like it's > loading a program for root. > Indeed. And programs in untrusted containers should not be able to do this.