> From: Eric W. Biederman <ebiederm@xxxxxxxxxxxx> > Sent: Monday, March 10, 2025 9:59 PM > > Parav Pandit <parav@xxxxxxxxxx> writes: > > > A process running in a non-init user namespace possesses the > > CAP_NET_RAW capability. However, the patch cited in the fixes tag > > checks the capability in the default init user namespace. > > Because of this, when the process was started by Podman in a > > non-default user namespace, the flow creation failed. > > This change isn't a bug fix. This change is a relaxation of permissions and it > would be very good if this change description described why it is in fact safe. As you explained below, it is not safe enough. :) I will improve the change description to reflect as I follow your good suggestions below. > > Many parts of the kernel are not safe for arbitrary users > to use. In those cases an ordinary capable like you found > is used. > Understood now. > > Fix this issue by checking the CAP_NET_RAW networking capability in > > the owner user namespace that created the network namespace. > > > > This change is similar to the following cited patches. > > > > commit 5e1fccc0bfac ("net: Allow userns root control of the core of > > the network stack.") commit 52e804c6dfaa ("net: Allow userns root to > > control ipv4") commit 59cd7377660a ("net: openvswitch: allow conntrack > > in non-initial user namespace") commit 0a3deb11858a ("fs: Allow > > listmount() in foreign mount namespace") commit dd7cb142f467 ("fs: > > relax permissions for listmount()") > > It is different in that hardware is involved. There is a fair amount of kernel > bypass allowed by design in infiniband so this may indeed be safe to allow > any user on the system to do. Still for someone who isn't intimate with > infiniband this isn't clear. > > > Fixes: c938a616aadb ("IB/core: Add raw packet QP type") > > Signed-off-by: Parav Pandit <parav@xxxxxxxxxx> > > > > --- > > I would like to have feedback from the LSM experts to make sure this > > fix is correct. Given the widespread usage of the capable() call, it > > makes me wonder if the patch right. > > > > Secondly, I wasn't able to determine which primary namespace (such as > > mount or IPC, etc.) to consider for the CAP_IPC_LOCK capability. > > (not directly related to this patch, but as concept) > > I took a quick look and it appears that no one figures any of the > CAP_IPC_LOCK capability checks are safe for anyone except the global root > user. > > Allowing an arbitrary user to lock all of memory seems to defeat all of the > safeguards that are in place to limiting memory locking. > > It looks like RLIMIT_MEMLOCK has been updated to be per user namespace > (with hierachical limits), so I expect the most reasonable thing to do is to > simply ensure the process that creates the user namespace has a large > enough RLIMIT_MEMLOCK when the user namespace is created. Ok, but if infiniband code does capable(), it is going to check the limit outside of the user namespace, and the call will still fails. Isn't it? May be the users in non init user ns must run their infiniband application without pinning the memory. Aka ODP in infiniband world. > > > --- > > drivers/infiniband/core/uverbs_cmd.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/infiniband/core/uverbs_cmd.c > > b/drivers/infiniband/core/uverbs_cmd.c > > index 5ad14c39d48c..8d6615f390f5 100644 > > --- a/drivers/infiniband/core/uverbs_cmd.c > > +++ b/drivers/infiniband/core/uverbs_cmd.c > > @@ -3198,7 +3198,7 @@ static int ib_uverbs_ex_create_flow(struct > uverbs_attr_bundle *attrs) > > if (cmd.comp_mask) > > return -EINVAL; > > > > - if (!capable(CAP_NET_RAW)) > > + if (!ns_capable(current->nsproxy->net_ns->user_ns, CAP_NET_RAW)) > > return -EPERM; > > Looking at the code in drivers/infiniband/core/uverbs_cmd.c > I don't think original capable call is actually correct. > > The problem is that infiniband runs commands through a file descriptor. > Which means that anyone who can open the file descriptor and then obtain a > program that will work like a suid cat can bypass the current permission > check. > > Before we relax any checks that test needs to be: > file_ns_capable(file, &init_user_ns, CAP_NET_RAW); > > Similarly the network namespace you are talking about in those infiniband > commands really needs to be derived from the file descriptor instead of > current. > This now start making sense to me. When the file descriptor is open, I need to record the net ns and use it for rest of the life cycle of the process (even if unshare(CLONE_NEWNET) is called) after opening the file. Something like how sk_alloc() does sock_net_set(sk, net); Do I understand you correctly? > Those kinds of bugs seem very easy to find in the infiniband code so I have a > hunch that the infiniband code needs some tender loving care before it is safe > for unprivileged users to be able to do anything with it. > Well, started to improve now... > In particular there was a whole lot of bug fixes and other work done to the > mount namespace and in the networking stack before allowing unprivileged > users to use it. > > In the ip part of the networking stack CAP_NET_RAW allows all kinds of things > but when it is limited to only a single networking stack (one the user had to > create) it becomes safe. I don't remember enough about infiniband to safe if > those parts guarded with CAP_NET_RAW are safe in that way. > Its restrictive use here as well in infiniband too for CAP_NET_RAW. > Eric > > > > > > if (cmd.flow_attr.flags >= IB_FLOW_ATTR_FLAGS_RESERVED)