On Thu, Jun 22, 2023 at 8:29 PM Andy Lutomirski <luto@xxxxxxxxxx> wrote: > > On Thu, Jun 22, 2023, at 12:05 PM, Andrii Nakryiko wrote: > > On Thu, Jun 22, 2023 at 9:50 AM Andy Lutomirski <luto@xxxxxxxxxx> wrote: > >> > >> > >> > >> On Thu, Jun 22, 2023, at 1:22 AM, Maryam Tahhan wrote: > >> > On 22/06/2023 00:48, Andrii Nakryiko wrote: > >> >> > >> >>>>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary. > >> >>>> BPF is still a privileged thing. You can't just say that any > >> >>>> unprivileged application should be able to use BPF. That's why BPF > >> >>>> token is about trusting unpriv application in a controlled environment > >> >>>> (production) to not do something crazy. It can be enforced further > >> >>>> through LSM usage, but in a lot of cases, when dealing with internal > >> >>>> production applications it's enough to have a proper application > >> >>>> design and rely on code review process to avoid any negative effects. > >> >>> We really shouldn’t be creating new kinds of privileged containers that do uncontained things. > >> >>> > >> >>> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container. > >> >> Please see Hao's reply ([0]) about his and Google's (not so rosy) > >> >> experiences with building and using such BPF proxy. We (Meta) > >> >> internally didn't go this route at all and strongly prefer not to. > >> >> There are lots of downsides and complications to having a BPF proxy. > >> >> In the end, this is just shuffling around where the decision about > >> >> trusting a given application with BPF access is being made. BPF proxy > >> >> adds lots of unnecessary logistical, operational, and development > >> >> complexity, but doesn't magically make anything safer. > >> >> > >> >> [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@xxxxxxxxxxxxxx/ > >> >> > >> > Apologies for being blunt, but the token approach to me seems to be a > >> > work around providing the right level/classification for a pod/container > >> > in order to say you support unprivileged containers using eBPF. I think > >> > if your container needs to do privileged things it should have and be > >> > classified with the right permissions (privileges) to do what it needs > >> > to do. > >> > >> Bluntness is great. > >> > >> I think that this whole level/classification thing is utterly wrong. Replace "BPF" with basically anything else, and you'll see how absurd it is. > > > > BPF is not "anything else", it's important to understand that BPF is > > inherently not compratmentalizable. And it's vast and generic in its > > capabilities. This changes everything. So your analogies are > > misleading. > > > > file descriptors are "vast and generic" -- you can open sockets, files, things in /proc, things in /sys, device nodes, etc. They are infinitely extensible. They work in containers. > > What is so special about BPF? Socket with a well-defined and constrained interface that defines what you can do with it (send and receive bytes, in a controlled fashion), and BPF programs that intentionally are allowed to have an almost arbitrarily complex control flow *controlled by user*, and can combine dozens if not hundreds of "building blocks" (BPF helpers, kfuncs, various BPF maps, etc) and that could be activated at various points deep in the kernel (and run that custom user-provided code in kernel space). I'd say that yeah, BPF is on another level as far as genericity goes, compared to other interfaces. And that's BPF's goal and appeal, nothing wrong with it. But I do think BPF and sockets, files, things in /proc, etc are pretty different in terms of how they can be proved and enforced to be sandboxed. > > >> > >> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using files on disk" > >> > >> That's very 1990's. Maybe 1980's. Of *course* giving access to a filesystem has some inherent security exposure. So we can give containers access to *different* filesystems. Or we can use ACLs. Or MAC policy. Or whatever. We have many solutions, none of which are perfect, and we're doing okay. > >> > >> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using the network" > >> > >> The network is a big deal. For some reason, it's cool these days to treat TCP as highly privileged. You can get secrets from your favorite (or least favorite) cloud provider with unauthenticated HTTP to a magic IP and port. You can bypass a whole lot of authenticating/authorizing proxies with unauthenticated HTTP (no TLS!) if you're on the right network. > >> > >> This is IMO obnoxious, but we deal with it by having network namespaces and firewalls and rather outdated port <= 1024 rules. > >> > >> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using BPF" > >> > >> My response is: what's wrong with BPF? BPF has maps and programs and such, and we could easily apply 1990's style ownership and DAC rules to them. > > > > Can you apply DAC rules to which kernel events BPF program can be run > > on? Can you apply DAC rules to which in-kernel data structures a BPF > > program can look at and make sure that it doesn't access a > > task/socket/etc that "belongs" to some other container/user/etc? > > No, of course. > > If you have a BPF program that is granted the ability to read kernel data structures or to run in response to global events like this, it's basically a kernel module. It may be subject to a verifier that imposes much stronger type safety than a kernel module is subject to, but it's still effectively a kernel module. > > We don't give containers special tokens that let them load arbitrary modules. We should not give them special tokens that let them do things with BPF that are functionally equivalent to loading arbitrary kernel modules. > > But we do have ways that kernel modules (which are "vast and generic", too) can expose their functionality safely to containers. BPF can learn to do this. > > > > > Can we limit XDP or AF_XDP BPF programs from seeing and controlling > > network traffic that will be eventually routed to a container that XDP > > program "should not" have access to? Without making everything so slow > > that it's useless? > > Of course you can -- assign an entire NIC or virtual function to a container, and let the XDP program handle that. Or a vlan or a macvlan or whatever. (I'm assuming XDP can be scoped like this. I'm not that familiar with the details.) > > > > >> I even *wrote the code*. > > > > Did you submit it upstream for review and wide discussion? > > Yes. > > > Did you > > test it and integrate it with production workloads to prove that your > > solution is actually a viable real-world solution and not a toy? > > I did test it. I did not integrate it with production workloads. > Real-world use cases are the ultimate test of APIs and features. No matter how brilliant and elegant the solution is, if it doesn't work with real-world applications, it's pretty useless. It's not that hard to allow only a very limited and very restrictive subset of BPF to be allowed to be loaded and attached from containers without privileged permissions. But the point is to find a solution that works for complicated (and sometimes very messy) real applications that were validated by humans (to the best of their abilities), but can't be proven to be contained within some container. > > Writing the code doesn't mean solving the problem. > > Of course not. My code was a little step in the right direction. The BPF community was apparently not interested in it. > > > > >> But for some reason, the BPF community wants to bury its head in the sand, pretend it's 1980, declare that BPF is too privileged to have access control, and instead just have a complicated switch to turn it on and off in different contexts. > > > > I won't speak on behalf of the entire BPF community, but I'm trying to > > explain that BPF cannot be reasonably sandboxed and has to be > > privileged due to its global nature. And I haven't yet seen any > > realistic counter-proposal to change that. And it's not about > > ownership of the BPF map or BPF program, it's way beyond that.. > > > > It's really really hard to have a useful discussion about a security model when have, as what appears to be an axiom, that a security model can't be created. > > If you actually feel this way, then I think you should not be advocating for allowing unprivileged containers to do the things that you think can't have a security model. > > I'm saying that I think there *can* be a security model. But until the maintainers start to believe that, there won't be one. See above, whatever security model you have in mind, it should be workable with real-world applications. Building some elegant system that will work for just a (rather small) subset of use cases isn't appealing.