On Tue, Mar 28, 2023 at 8:03 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > On Wed, Mar 29, 2023 at 1:15 AM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote: > > > > On 03/28, Yafang Shao wrote: > > > On Tue, Mar 28, 2023 at 1:28 AM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote: > > > > > > > > On 03/26, Yafang Shao wrote: > > > > > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert > > > IDs > > > > > to FDs, that's intended for BPF's security model[1]. Not only does it > > > > > prevent non-privilidged users from getting other users' bpf program, > > > but > > > > > also it prevents the user from iterating his own bpf objects. > > > > > > > > > In container environment, some users want to run bpf programs in their > > > > > containers. These users can run their bpf programs under CAP_BPF and > > > > > some other specific CAPs, but they can't inspect their bpf programs > > > in a > > > > > generic way. For example, the bpftool can't be used as it requires > > > > > CAP_SYS_ADMIN. That is very inconvenient. > > > > > > > > > Without CAP_SYS_ADMIN, the only way to get the information of a bpf > > > object > > > > > which is not created by the process itself is with SCM_RIGHTS, that > > > > > requires each processes which created bpf object has to implement a > > > unix > > > > > domain socket to share the fd of a bpf object between different > > > > > processes, that is really trivial and troublesome. > > > > > > > > > Hence we need a better mechanism to get bpf object info without > > > > > CAP_SYS_ADMIN. > > > > > > > > [..] > > > > > > > > > BPF namespace is introduced in this patchset with an attempt to remove > > > > > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and > > > > > link in a specific bpf namespace, then these bpf objects will not be > > > > > visible to the users in a different bpf namespace. But these bpf > > > > > objects are visible to its parent bpf namespace, so the sys admin can > > > > > still iterate and inspect them. > > > > > > > > Does it essentially mean unpriv bpf? > > > > > Right. With CAP_BPF and some other CAPs enabled. > > > > > > Can I, as a non-root, create > > > > a new bpf namespace and start loading/attaching progs? > > > > > No, you can't create a new bpf namespace as a non-root, see also > > > copy_namespaces(). > > > In the container environment, new namespaces are always created by > > > containered, which is started by root. > > > > Are you talking about "if (!ns_capable(user_ns, CAP_SYS_ADMIN))" part > > from copy_namespaces? Isn't it trivially bypassed with a new user > > namespace? > > > > IIUC, I can create a new user namespace which gives me CAP_SYS_ADMIN > > in this particular user-ns. Then I can go on and create a new bpf > > namespace (with CAP_BPF) and go wild? I won't see anything from the > > other namespaces, but I'll be able to load/attach bpf programs? > > > > I don't think so. If you create a new userspace, and give the process > the CAP_BPF or CAP_SYS_ADMIN in this new user namespace but not the > initial namespace, you can't do that. Because currently only CAP_BPF > or CAP_SYS_ADMIN in the init user namespace can load/attach bpf > programs. > > > > > Maybe add a paragraph about now vs whatever you're proposing. > > > > > What I'm proposing in this patchset is to put bpf objects (map, prog, > > > link, and btf) into the bpf namespace. Next step I will put bpffs into > > > the bpf namespace as well. > > > That said, I'm trying to put all the objects created in bpf into the > > > bpf namespace. Below is a simple paragraph to illustrate it. > > > > > Regarding the unpriv user with CAP_BPF enabled, > > > Now | Future > > > ------------------------------------------------------------------------ > > > Iterate his BPF IDs | N | Y | > > > Iterate others' BPF IDs | N | N | > > > Convert his BPF IDs to FDs | N | Y | > > > Convert others' BPF IDs to FDs | N | N | > > > Get others' object info from pinned file | Y(*) | N | > > > ------------------------------------------------------------------------ > > > > > (*) It can be improved by, > > > 1). Different containers has different bpffs > > > 2). Setting file permission > > > That's not perfect, for example, if one single user has two bpf > > > instances, but we don't want them to inspect each other. > > > > I think the question here is what happens to the existing > > capable(CAP_BPF) checks? Do they become ns_capable(CAP_BPF) eventually? > > > > They won't become ns_capable(CAP_BPF). If it becomes > ns_capable(CAP_BPF), it will really go wild then. > > > And if not, I don't think it integrates well with the user namespaces? > > > > IIUC, it is the CAP_BPF which doesn't integrate with the user > namespaces, right? Yeah. And it's probably fine if we don't, I just wanted to see some explanation on the long-term plan. If the purpose is to have a bpf namespace and use it for pure isolation purposes, let's state it clearly in the cover letter. Otherwise it's not clear whether it's only about isolation or potentially allowing CAP_BPF in user namespaces. Maybe clone(CLONE_NEWBPF|CLONE_NEWUSER) should return an explicit error? (or maybe it already does, haven't looked at the patches) One other question I have is: should init bpf namespace see everything? Otherwise it might be hard to chase down the namespaces that loaded some BPF program... > -- > Regards > Yafang