On Fri, Aug 23, 2019 at 04:09:11PM -0700, Andy Lutomirski wrote: > On Thu, Aug 22, 2019 at 4:26 PM Alexei Starovoitov > <alexei.starovoitov@xxxxxxxxx> wrote: > > You're proposing all of the above in addition to CAP_BPF, right? > > Otherwise I don't see how it addresses the use cases I kept > > explaining for the last few weeks. > > None of my proposal is intended to exclude changes like CAP_BPF to > make privileged bpf() operations need less privilege. But I think > it's very hard to evaluate CAP_BPF without both a full description of > exactly what CAP_BPF would do and what at least one full example of a > user would look like. the example is previous email and systemd example was not "full" ? > I also think that users who want CAP_BPF should look at manipulating > their effective capability set instead. A daemon that wants to use > bpf() but otherwise minimize the chance of accidentally causing a > problem can use capset() to clear its effective and inheritable masks. > Then, each time it wants to call bpf(), it could re-add CAP_SYS_ADMIN > or CAP_NET_ADMIN to its effective set, call bpf(), and then clear its > effective set again. This works in current kernels and is generally > good practice. Such logic means that CAP_NET_ADMIN is not necessary either. The process could re-add CAP_SYS_ADMIN when it needs to reconfigure network and then drop it. > Aside from this, and depending on exactly what CAP_BPF would be, I > have some further concerns. Looking at your example in this email: > > > Here is another example of use case that CAP_BPF is solving: > > The daemon X is started by pid=1 and currently runs as root. > > It loads a bunch of tracing progs and attaches them to kprobes > > and tracepoints. It also loads cgroup-bpf progs and attaches them > > to cgroups. All progs are collecting data about the system and > > logging it for further analysis. > > This needs more than just bpf(). Creating a perf kprobe event > requires CAP_SYS_ADMIN, and without a perf kprobe event, you can't > attach a bpf program. that is already solved sysctl_perf_event_paranoid. CAP_BPF is about BPF part only. > And the privilege to attach bpf programs to > cgroups without any DAC or MAC checks (which is what the current API > does) is an extremely broad privilege that is not that much weaker > than CAP_SYS_ADMIN or CAP_NET_ADMIN. Also: I don't think there is a hierarchy of CAP_SYS_ADMIN vs CAP_NET_ADMIN vs CAP_BPF. CAP_BPF and CAP_NET_ADMIN carve different areas of CAP_SYS_ADMIN. Just like all other caps. > > This tracing bpf is looking into kernel memory > > and using bpf_probe_read. Clearly it's not _secure_. But it's _safe_. > > The system is not going to crash because of BPF, > > but it can easily crash because of simple coding bugs in the user > > space bits of that daemon. > > The BPF verifier and interpreter, taken in isolation, may be extremely > safe, but attaching BPF programs to various hooks can easily take down > the system, deliberately or by accident. A handler, especially if it > can access user memory or otherwise fault, will explode if attached to > an inappropriate kprobe, hw_breakpoint, or function entry trace event. absolutely not true. > (I and the other maintainers consider this to be a bug if it happens, > and we'll fix it, but these bugs definitely exist.) A cgroup-bpf hook > that blocks all network traffic will effectively kill a machine, > especially if it's a server. this permission is granted by CAP_NET_ADMIN. Nothing changes here. > A bpf program that runs excessively > slowly attached to a high-frequency hook will kill the system, too. not true either. > (I bet a buggy bpf program that calls bpf_probe_read() on an unmapped > address repeatedly could be make extremely slow. Page faults take > thousands to tens of thousands of cycles.) kprobe probing and faulting on non-existent address will do the same 'damage'. So it's not bpf related. Also it won't make the system "extremely slow". Nothing to do with CAP_BPF. > A bpf firewall rule that's > wrong can cut a machine off from the network -- I've killed machines > using iptables more than once, and bpf isn't magically safer. this is CAP_NET_ADMIN permission. It's a different capability. > > I'm wondering if something like CAP_TRACING would make sense. > CAP_TRACING would allow operations that can reveal kernel memory and > other secret kernel state but that do not, by design, allow modifying > system behavior. So, for example, CAP_TRACING would allow privileged > perf_event_open() operations and privileged bpf verifier usage. But > it would not allow cgroup-bpf unless further restrictions were added, > and it would not allow the *_BY_ID operations, as those can modify > other users' bpf programs' behavior. Makes little sense to me. I can imagine CAP_TRACING controlling kprobe/uprobe creation and probe_read() both from bpf side and from vanilla kprobe. That would be much nicer interface to use than existing sysctl_perf_event_paranoid, but that is orthogonal to CAP_BPF which is strictly about BPF. > Something finer-grained can mitigate some of this. CAP_BPF as I think > you're imagining it will not. I'm afraid this discussion goes nowhere. We'll post CAP_BPF patches soon so we can discuss code.