On Thu, Jun 22, 2023 at 10:38 AM Maryam Tahhan <mtahhan@xxxxxxxxxx> wrote: > Please avoid replying in HTML. > On 22/06/2023 17:49, Andy Lutomirski wrote: > > Apologies for being blunt, but the token approach to me seems to be a > work around providing the right level/classification for a pod/container > in order to say you support unprivileged containers using eBPF. I think > if your container needs to do privileged things it should have and be > classified with the right permissions (privileges) to do what it needs > to do. > > Bluntness is great. > > I think that this whole level/classification thing is utterly wrong. Replace "BPF" with basically anything else, and you'll see how absurd it is. > > "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using files on disk" > > That's very 1990's. Maybe 1980's. Of *course* giving access to a filesystem has some inherent security exposure. So we can give containers access to *different* filesystems. Or we can use ACLs. Or MAC policy. Or whatever. We have many solutions, none of which are perfect, and we're doing okay. > > "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using the network" > > The network is a big deal. For some reason, it's cool these days to treat TCP as highly privileged. You can get secrets from your favorite (or least favorite) cloud provider with unauthenticated HTTP to a magic IP and port. You can bypass a whole lot of authenticating/authorizing proxies with unauthenticated HTTP (no TLS!) if you're on the right network. > > This is IMO obnoxious, but we deal with it by having network namespaces and firewalls and rather outdated port <= 1024 rules. > > "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using BPF" > > My response is: what's wrong with BPF? BPF has maps and programs and such, and we could easily apply 1990's style ownership and DAC rules to them. I even *wrote the code*. But for some reason, the BPF community wants to bury its head in the sand, pretend it's 1980, declare that BPF is too privileged to have access control, and instead just have a complicated switch to turn it on and off in different contexts. > > Please try harder. > > I'm going to be honest, I can't tell if we are in agreement or not :). I'm also going to use pod and container interchangeably throughout my response (bear with me) > > > So just to clarify a few things on my end. When I said "level/classification" I meant privileges --> A container should have the right level of privileges assigned to it for what it's trying to do in the K8s scenario through it's pod spec. To me it seems like BPF token is a way to work around the permissions assigned to a container in K8s for example: with bpf_token I'm marking a pod as unprivileged but then under the hood, through another service I'm giving it a token to do more than what it was specified in it's pod spec. Yeah I have a separate service controlling the tokens but something about it just seems not right (to me). If CAP_BPF is too broad, can we break it down further into something more granular. Something that can be assigned to the container through the pod spec rather than a separate service that seems to be doing things under the hood? This doesn't even start to solve the problem I know... Disclaimer: I don't know anything about Kubernetes, so don't expect me reply with correct terminology or detailed understanding of configuration of containers. But on a more generic and conceptual level, it seems like you are making some implementation assumptions and arguing based on that. Like, why container spec cannot have native support for "granted BPF functionality"? Why would BPF token have to be granted through some separate service and not integrated into whatever Kubernetes' "container manager" functionality and just be a natural extension of the spec? For CAP_BPF too broad. It is broad, yes. If you have good ideas how to break it down some more -- please propose. But this is all orthogonal, because the blocking problem is fundamental incompatibility of user namespaces (and their implied isolation and sandboxing of workloads) and BPF functionality, which is global by its very nature. The latter is unavoidable in principle. No matter how much you break down CAP_BPF, you can't enforce that BPF program won't interfere with applications in other containers. Or that it won't "spy" on them. It's just not what BPF can enforce in principle. So that comes back down to a question of trust and then controlled delegation of BPF functionality. You trust workload with BPF usage because you reviewed the BPF code, workload, testing, etc? Grant BPF token and let that container use limited subset of BPF. Employ BPF LSM to further restrict it beyond what BPF token can control. You cannot trust an application to not do something harmful? You shouldn't grant it either CAP_BPF in init namespace, nor BPF token in user namespace. That's it. Pick your poison. But all this cannot be mechanically decided or enforced. There has to be some humans involved in making these decisions. Kernel's job is to provide building blocks to grant and control BPF functionality to the extent that it is technically possible. > > I understand the difficulties with trying to deploy BPF in K8s and the concerns around privilege escalation for the containers. I understand not all use cases are created equally but I think this falls into at least 2 categories: > > - Pods/Containers that need to do privileged BPF ops but not under a CAP_BPF umbrella --> sure we need something for this. > - Pods/Containers that don't need to do any privileged BPF ops but still use BPF --> these are happy with a proxy service loading/unloading the bpf progs, creating maps and pinning them... But even in this scenario we need something to isolate the pinned maps/progs by different apps (why not DAC rules?), even better if the maps are in the container... The above doesn't make much sense to me, sorry. If the application is ok using unprivileged BPF, there is no problem there. They can today already and there is no BPF proxy or BPF token involved. As for "something to isolate the pinned maps/progs by different apps (why not DAC rules?)", there is no such thing, as I've explained already. I can install sched_switch raw_tracepoint BPF program (if I'm allowed to), and that program has system-wide observability. It cannot be bound to an application. You can't just say "trigger this sched_switch program only for scheduler decisions within my container". When you actually start thinking about just that one example, even assuming we add some per-container filter in the kernel to not trigger your program, then what do we do when we switch from process A in container X to process B in container Y? Is that event belonging to container X? Or container Y? How can you prevent a program from reading a task's data that doesn't belong to your container, when both are inputs to this single tracepoint event? Hopefully you can see where I'm going with this. And this is just one random tiny example. We can think up tons of other cases to prove BPF is not isolatable to any sort of "container". > > Anyway - I hope this clarifies my original intent - which is proxy at least starts to solve one part of the puzzle. Whatever approach(es) we take to solve the rest of these problems the more we can stick to tried and trusted mechanisms the better. I disagree. BPF proxy complicates logistics, operations, and developer experience, without resolving the issue of determining trust and the need to delegate or proxy BPF functionality.