On Fri, Dec 08, 2023 at 02:39:56PM -0800, Andrii Nakryiko wrote: > On Fri, Dec 8, 2023 at 5:41 AM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > > > On Thu, Nov 30, 2023 at 10:52:15AM -0800, Andrii Nakryiko wrote: > > > Add new kind of BPF kernel object, BPF token. BPF token is meant to > > > allow delegating privileged BPF functionality, like loading a BPF > > > program or creating a BPF map, from privileged process to a *trusted* > > > unprivileged process, all while having a good amount of control over which > > > privileged operations could be performed using provided BPF token. > > > > > > This is achieved through mounting BPF FS instance with extra delegation > > > mount options, which determine what operations are delegatable, and also > > > constraining it to the owning user namespace (as mentioned in the > > > previous patch). > > > > > > BPF token itself is just a derivative from BPF FS and can be created > > > through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF > > > FS FD, which can be attained through open() API by opening BPF FS mount > > > point. Currently, BPF token "inherits" delegated command, map types, > > > prog type, and attach type bit sets from BPF FS as is. In the future, > > > having an BPF token as a separate object with its own FD, we can allow > > > to further restrict BPF token's allowable set of things either at the > > > creation time or after the fact, allowing the process to guard itself > > > further from unintentionally trying to load undesired kind of BPF > > > programs. But for now we keep things simple and just copy bit sets as is. > > > > > > When BPF token is created from BPF FS mount, we take reference to the > > > BPF super block's owning user namespace, and then use that namespace for > > > checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN} > > > capabilities that are normally only checked against init userns (using > > > capable()), but now we check them using ns_capable() instead (if BPF > > > token is provided). See bpf_token_capable() for details. > > > > > > Such setup means that BPF token in itself is not sufficient to grant BPF > > > functionality. User namespaced process has to *also* have necessary > > > combination of capabilities inside that user namespace. So while > > > previously CAP_BPF was useless when granted within user namespace, now > > > it gains a meaning and allows container managers and sys admins to have > > > a flexible control over which processes can and need to use BPF > > > functionality within the user namespace (i.e., container in practice). > > > And BPF FS delegation mount options and derived BPF tokens serve as > > > a per-container "flag" to grant overall ability to use bpf() (plus further > > > restrict on which parts of bpf() syscalls are treated as namespaced). > > > > > > Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF) > > > within the BPF FS owning user namespace, rounding up the ns_capable() > > > story of BPF token. > > > > > > Signed-off-by: Andrii Nakryiko <andrii@xxxxxxxxxx> > > > --- > > > > Same concerns as in the other mail. For the bpf_token_create() code, > > Acked-by: Christian Brauner <brauner@xxxxxxxxxx> > > This patch set has landed in bpf-next and there are a bunch of other > patches after it, so I presume it will be a bit problematic to add ack > after the fact. But thanks for taking another look and acking! Yeah, I don't mind. :)