On Thu, Nov 30, 2023 at 6:18 AM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > On Mon, Nov 27, 2023 at 11:03:54AM -0800, Andrii Nakryiko wrote: > > Add few new mount options to BPF FS that allow to specify that a given > > BPF FS instance allows creation of BPF token (added in the next patch), > > and what sort of operations are allowed under BPF token. As such, we get > > 4 new mount options, each is a bit mask > > - `delegate_cmds` allow to specify which bpf() syscall commands are > > allowed with BPF token derived from this BPF FS instance; > > - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies > > a set of allowable BPF map types that could be created with BPF token; > > - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies > > a set of allowable BPF program types that could be loaded with BPF token; > > - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies > > a set of allowable BPF program attach types that could be loaded with > > BPF token; delegate_progs and delegate_attachs are meant to be used > > together, as full BPF program type is, in general, determined > > through both program type and program attach type. > > > > Currently, these mount options accept the following forms of values: > > - a special value "any", that enables all possible values of a given > > bit set; > > - numeric value (decimal or hexadecimal, determined by kernel > > automatically) that specifies a bit mask value directly; > > - all the values for a given mount option are combined, if specified > > multiple times. E.g., `mount -t bpf nodev /path/to/mount -o > > delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3 > > mask. > > > > Ideally, more convenient (for humans) symbolic form derived from > > corresponding UAPI enums would be accepted (e.g., `-o > > delegate_progs=kprobe|tracepoint`) and I intend to implement this, but > > it requires a bunch of UAPI header churn, so I postponed it until this > > feature lands upstream or at least there is a definite consensus that > > this feature is acceptable and is going to make it, just to minimize > > amount of wasted effort and not increase amount of non-essential code to > > be reviewed. > > > > Attentive reader will notice that BPF FS is now marked as > > FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init > > user namespace as long as the process has sufficient *namespaced* > > capabilities within that user namespace. But in reality we still > > restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in > > init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added > > to allow creating BPF FS context object (i.e., fsopen("bpf")) from > > inside unprivileged process inside non-init userns, to capture that > > userns as the owning userns. It will still be required to pass this > > context object back to privileged process to instantiate and mount it. > > > > This manipulation is important, because capturing non-init userns as the > > owning userns of BPF FS instance (super block) allows to use that userns > > to constraint BPF token to that userns later on (see next patch). So > > creating BPF FS with delegation inside unprivileged userns will restrict > > derived BPF token objects to only "work" inside that intended userns, > > making it scoped to a intended "container". Also, setting these > > delegation options requires capable(CAP_SYS_ADMIN), so unprivileged > > process cannot set this up without involvement of a privileged process. > > > > There is a set of selftests at the end of the patch set that simulates > > this sequence of steps and validates that everything works as intended. > > But careful review is requested to make sure there are no missed gaps in > > the implementation and testing. > > > > This somewhat subtle set of aspects is the result of previous > > discussions ([0]) about various user namespace implications and > > interactions with BPF token functionality and is necessary to contain > > BPF token inside intended user namespace. > > > > [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/ > > > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > > --- > > I still think this is a little weird because this isn't really > unprivileged bpf and it isn't really safe bpf as well. > > All this does is allow an administrator to punch a big fat hole into an > unprivileged container so workloads get to play with their favorite toy. > > I think that having a way to have signed bpf programs in addition to > this would be much more interesting to generic workloads that don't know > who or what they can trust. Absolutely, that is the trust part of the puzzle. Song's LSM+fsverity patch set is a step in that direction. > > And there's a few things to remember: > > * This absolutely isn't a safety mechanism. > * This absolutely isn't safe to enable generically in containers. 100% agreed, this should be used only with trusted workloads (however the trust is established in any particular setup: signing, fsverity, whatnot) > * This is a workaround and not a solution to unprivileged bpf. > > And this is an ACK solely on the code of this patch, not the concept. > Acked-by: Christian Brauner <brauner@xxxxxxxxxx> (reluctantly) Thank you anyways! I appreciate your efforts to make sure this doesn't create unintended security holes.