On Wed, Nov 8, 2023 at 5:51 AM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > On Fri, Nov 03, 2023 at 12:05:08PM -0700, Andrii Nakryiko wrote: > > Add few new mount options to BPF FS that allow to specify that a given > > BPF FS instance allows creation of BPF token (added in the next patch), > > and what sort of operations are allowed under BPF token. As such, we get > > 4 new mount options, each is a bit mask > > - `delegate_cmds` allow to specify which bpf() syscall commands are > > allowed with BPF token derived from this BPF FS instance; > > - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies > > a set of allowable BPF map types that could be created with BPF token; > > - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies > > a set of allowable BPF program types that could be loaded with BPF token; > > - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies > > a set of allowable BPF program attach types that could be loaded with > > BPF token; delegate_progs and delegate_attachs are meant to be used > > together, as full BPF program type is, in general, determined > > through both program type and program attach type. > > > > Currently, these mount options accept the following forms of values: > > - a special value "any", that enables all possible values of a given > > bit set; > > - numeric value (decimal or hexadecimal, determined by kernel > > automatically) that specifies a bit mask value directly; > > - all the values for a given mount option are combined, if specified > > multiple times. E.g., `mount -t bpf nodev /path/to/mount -o > > delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3 > > mask. > > > > Ideally, more convenient (for humans) symbolic form derived from > > corresponding UAPI enums would be accepted (e.g., `-o > > delegate_progs=kprobe|tracepoint`) and I intend to implement this, but > > it requires a bunch of UAPI header churn, so I postponed it until this > > feature lands upstream or at least there is a definite consensus that > > this feature is acceptable and is going to make it, just to minimize > > amount of wasted effort and not increase amount of non-essential code to > > be reviewed. > > > > Attentive reader will notice that BPF FS is now marked as > > FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init > > user namespace as long as the process has sufficient *namespaced* > > capabilities within that user namespace. But in reality we still > > restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in > > init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added > > to allow creating BPF FS context object (i.e., fsopen("bpf")) from > > inside unprivileged process inside non-init userns, to capture that > > userns as the owning userns. It will still be required to pass this > > context object back to privileged process to instantiate and mount it. > > > > This manipulation is important, because capturing non-init userns as the > > owning userns of BPF FS instance (super block) allows to use that userns > > to constraint BPF token to that userns later on (see next patch). So > > creating BPF FS with delegation inside unprivileged userns will restrict > > derived BPF token objects to only "work" inside that intended userns, > > making it scoped to a intended "container". > > > > There is a set of selftests at the end of the patch set that simulates > > this sequence of steps and validates that everything works as intended. > > But careful review is requested to make sure there are no missed gaps in > > the implementation and testing. > > > > All this is based on suggestions and discussions with Christian Brauner > > ([0]), to the best of my ability to follow all the implications. > > "who will not be held responsible for any CVE future or present as he's > not sure whether bpf token is a good idea in general" > > I'm not opposing it because it's really not my subsystem. But it'd be > nice if you also added a disclaimer that I'm not endorsing this. :) > Sure, I'll clarify. I still appreciate your reviewing everything and pointing out all the gotchas (like the reconfiguration and other stuff), thanks! > A comment below. > > > > > [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/ > > > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > > --- > > include/linux/bpf.h | 10 ++++++ > > kernel/bpf/inode.c | 88 +++++++++++++++++++++++++++++++++++++++------ > > 2 files changed, 88 insertions(+), 10 deletions(-) > > [...] > > opt = fs_parse(fc, bpf_fs_parameters, param, &result); > > if (opt < 0) { > > @@ -665,6 +692,25 @@ static int bpf_parse_param(struct fs_context *fc, struct fs_parameter *param) > > case OPT_MODE: > > opts->mode = result.uint_32 & S_IALLUGO; > > break; > > + case OPT_DELEGATE_CMDS: > > + case OPT_DELEGATE_MAPS: > > + case OPT_DELEGATE_PROGS: > > + case OPT_DELEGATE_ATTACHS: > > + if (strcmp(param->string, "any") == 0) { > > + msk = ~0ULL; > > + } else { > > + err = kstrtou64(param->string, 0, &msk); > > + if (err) > > + return err; > > + } > > + switch (opt) { > > + case OPT_DELEGATE_CMDS: opts->delegate_cmds |= msk; break; > > + case OPT_DELEGATE_MAPS: opts->delegate_maps |= msk; break; > > + case OPT_DELEGATE_PROGS: opts->delegate_progs |= msk; break; > > + case OPT_DELEGATE_ATTACHS: opts->delegate_attachs |= msk; break; > > + default: return -EINVAL; > > + } > > + break; > > } > > So just to repeat that this will allow a container to set it's own > delegation options: > > # unprivileged container > > fd_fs = fsopen(); > fsconfig(fd_fs, FSCONFIG_BLA_BLA, "give-me-all-the-delegation"); > > # Now hand of that fd_fs to a privileged process > > fsconfig(fd_fs, FSCONFIG_CREATE_CMD, ...) > > This means the container manager can't be part of your threat model > because you need to trust it to set delegation options. > > But if the container manager is part of your threat model then you can > never trust an fd_fs handed to you because the container manager might > have enabled arbitrary delegation privileges. > > There's ways around this: > > (1) kernel: Account for this in the kernel and require privileges when > setting delegation options. What sort of privilege would that be? We are in an unprivileged user namespace, so that would have to be some ns_capable() checks or something? I can add ns_capable(CAP_BPF), but what else did you have in mind? I think even if we say that privileged parent does FSCONFIG_SET_STRING and unprivileged child just does sys_fsopen("bpf", 0) and nothing more, we still can't be sure that child won't race with parent and set FSCONFIG_SET_STRING at the same time. Because they both have access to the same fs_fd. > (2) userspace: A trusted helper that allocates an fs_context fd in > the target user namespace, then sets delegation options and creates > superblock. > > (1) Is more restrictive but also more secure. (2) is less restrictive > but requires more care from userspace. > > Either way I would probably consider writing a document detailing > various delegation scenarios and possible pitfalls and implications > before advertising it. > > If you choose (2) then you also need to be aware that the security of > this also hinges on bpffs not allowing to reconfigure parameters once it > has been mounted. Otherwise an unprivileged container can change > delegation options. > > I would recommend that you either add a dummy bpf_reconfigure() method > with a comment in it or you add a comment on top of bpf_context_ops. > Something like: > > /* > * Unprivileged mounts of bpffs are owned by the user namespace they are > * mounted in. That means unprivileged users can change vfs mount > * options (ro<->rw, nosuid, etc.). > * > * They currently cannot change bpffs specific mount options such as > * delegation settings. If that is ever implemented it is necessary to > * require rivileges in the initial namespace. Otherwise unprivileged > * users can change delegation options to whatever they want. > */ Yep, I will add a custom callback. I think we can allow reconfiguring towards less permissive delegation subset, but I'll need to look at the specifics to see if we can support that easily.