Re: [PATCH v2 bpf-next 00/18] BPF token

"Andy Lutomirski" <luto@xxxxxxxxxx> · Tue, 04 Jul 2023 14:05:08 -0700

On Mon, Jun 26, 2023, at 3:08 PM, Andrii Nakryiko wrote:
> On Fri, Jun 23, 2023 at 4:07 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> writes:
>>
>> >> applications meets the needs of these PODs that need to do
>> >> privileged/bpf things without any tokens. Ultimately you are trusting
>> >> these apps in the same way as if you were granting a token.
>> >
>> > Yes, absolutely. As I mentioned very explicitly, it's the question of
>> > trusting application. Service vs token is implementation details, but
>> > the one that has huge implications in how applications are built,
>> > tested, versioned, deployed, etc.
>>
>> So one thing that I don't really get is why such a "trusted application"
>> needs to be run in a user namespace in the first place? If it's trusted,
>> why not simply run it as a privileged container (without the user
>> namespace) and grant it the right system-level capabilities, instead of
>> going to all this trouble just to punch a hole in the user namespace
>> isolation?
>
> Because it's still useful to provide isolation that user namespace
> provides in all other aspects besides BPF usage.
>
> The fact that it's a trusted application doesn't mean that bugs don't
> happen, or that some action that was not intended might be attempted
> (due to a bug, some deep unintended library "feature", or just because
> someone didn't anticipate some interaction).
>
> Trusted here means we believe our BPF usage is not going to spy on
> sensitive data, or attempt to disrupt other workloads, because of
> design and code reviews, and we intend to maintain that property. But
> people are still involved, of course, and bugs do happen. We'd like to
> get as much protection as possible, and that's what the user namespace
> is offering.
>

I'm wondering if your approach makes sense for Meta but maybe not outside Meta.  I think Meta is a bit unusual in that it operates a huge fleet, but the developers of the software in that fleet are a fairly tight group.   (I'm speculating here.  I don't know much about what goes on inside Meta, obviously.)

Concretely, you say "we believe our BPF usage is not going to spy on sensitive data".  Who is this "we"?  The kernel developers?  The people developing the BPF programs?  The people setting policy for the fleet?  The people creating container images that want to use BPF and run within the fleet?  Are these all the same "we"?

For a company with actual outside tenants, or a company that needs to comply with various privacy rules for some, but not all, of its applications, there are a lot of "we"s involved.  Some group develops software (or this is outsourced -- the BPF maintainership is essentially within Meta, after all).  Some group administers the fleet.  Some group develops BPF programs (or downloads them from outside and hopefully vets them).  Some group builds container images that want to use those programs.  Some group deploys these images via kubernetes or whatever.  Some group prepares reports for that say that certain services offered comply with PCI or HIPAA or FedRAMP or GDPR or whatever.  They're not all the same people.

Obviously bugs exist and mistakes happen.  But, at the end of the day, someone is going to read a BPF program (or a kernel module, or whatever) and take some degree of responsibility for saying "I read this thing, and I approve its use in a certain context".  And then *that permission* should be granted.  With your patchset as it is, the permission granted is not "run this program I approved" but rather "read all kernel memory".  And I don't think that will fly with a lot of potential users.

> For BPF-side of things, we have to trust the process because there is
> no technical solution. Running outside the user namespace we also
> don't have any guarantees about BPF. We just have even less protection
> in all other aspects outside of BPF. We are trying to improve our
> story with user namespace to mitigate what's mitigatable.

But there *are* technical solutions.  At least two broad types, as I've been trying to say.

1. Stronger and more flexible controls as to which specific programs can be loaded and run.  The people doing the trusting may very well want to trust specific things (and audit which things they've trusted, etc.)

2. Stronger and more flexible controls as to what programs can do.  Right now, bpf() can attach to essentially any cgroup or tracepoint if it can attach to any at all.  Programs can acccess all kernel memory (because alternatives to bpf_probe_kernel_read() aren't really available, and there is no incentive right now to add them, because there isn't even a way AFAIK to turn off bpf_probe_kernel_read()).

Progress on either one of these could go a long way.