> From: Eric W. Biederman <ebiederm@xxxxxxxxxxxx> > Sent: Monday, March 10, 2025 11:44 PM > > Parav Pandit <parav@xxxxxxxxxx> writes: > > >> From: Eric W. Biederman <ebiederm@xxxxxxxxxxxx> > >> Sent: Monday, March 10, 2025 9:59 PM > >> > >> Parav Pandit <parav@xxxxxxxxxx> writes: > >> > >> > A process running in a non-init user namespace possesses the > >> > CAP_NET_RAW capability. However, the patch cited in the fixes tag > >> > checks the capability in the default init user namespace. > >> > Because of this, when the process was started by Podman in a > >> > non-default user namespace, the flow creation failed. > >> > >> This change isn't a bug fix. This change is a relaxation of > >> permissions and it would be very good if this change description described > why it is in fact safe. > > As you explained below, it is not safe enough. :) I will improve the > > change description to reflect as I follow your good suggestions below. > > > >> > >> Many parts of the kernel are not safe for arbitrary users > >> to use. In those cases an ordinary capable like you found > >> is used. > >> > > Understood now. > > > >> > Fix this issue by checking the CAP_NET_RAW networking capability in > >> > the owner user namespace that created the network namespace. > >> > > >> > This change is similar to the following cited patches. > >> > > >> > commit 5e1fccc0bfac ("net: Allow userns root control of the core of > >> > the network stack.") commit 52e804c6dfaa ("net: Allow userns root > >> > to control ipv4") commit 59cd7377660a ("net: openvswitch: allow > >> > conntrack in non-initial user namespace") commit 0a3deb11858a ("fs: > >> > Allow > >> > listmount() in foreign mount namespace") commit dd7cb142f467 ("fs: > >> > relax permissions for listmount()") > >> > >> It is different in that hardware is involved. There is a fair amount > >> of kernel bypass allowed by design in infiniband so this may indeed > >> be safe to allow any user on the system to do. Still for someone who > >> isn't intimate with infiniband this isn't clear. > >> > >> > Fixes: c938a616aadb ("IB/core: Add raw packet QP type") > >> > Signed-off-by: Parav Pandit <parav@xxxxxxxxxx> > >> > > >> > --- > >> > I would like to have feedback from the LSM experts to make sure > >> > this fix is correct. Given the widespread usage of the capable() > >> > call, it makes me wonder if the patch right. > >> > > >> > Secondly, I wasn't able to determine which primary namespace (such > >> > as mount or IPC, etc.) to consider for the CAP_IPC_LOCK capability. > >> > (not directly related to this patch, but as concept) > >> > >> I took a quick look and it appears that no one figures any of the > >> CAP_IPC_LOCK capability checks are safe for anyone except the global > >> root user. > >> > >> Allowing an arbitrary user to lock all of memory seems to defeat all > >> of the safeguards that are in place to limiting memory locking. > >> > >> It looks like RLIMIT_MEMLOCK has been updated to be per user > >> namespace (with hierachical limits), so I expect the most reasonable > >> thing to do is to simply ensure the process that creates the user > >> namespace has a large enough RLIMIT_MEMLOCK when the user > namespace is created. > > Ok, but if infiniband code does capable(), it is going to check the limit > outside of the user namespace, and the call will still fails. > > Isn't it? > > It depends on how the check is implemented. My point is that > RLIMIT_MEMLOCK has all of the knobs you might need to do something. > > I don't know if the checks you are concerned about allow using > RLIMIT_MEMLOCK. Given that some of them require having root in the initial > user namespace they might make a lot of assumptions. > > But rlimits are related to but separate from capabilities. > Ok. I will take this task separately. > > May be the users in non init user ns must run their infiniband application > without pinning the memory. > > Aka ODP in infiniband world. > > That sounds right. I don't remember enough about infiniband to say for > certain. > > Basically anything that uses ns_capable should be treated as something any > user can do, and so you need to watch out for hostile users. > Got it. > >> > --- > >> > drivers/infiniband/core/uverbs_cmd.c | 2 +- > >> > 1 file changed, 1 insertion(+), 1 deletion(-) > >> > > >> > diff --git a/drivers/infiniband/core/uverbs_cmd.c > >> > b/drivers/infiniband/core/uverbs_cmd.c > >> > index 5ad14c39d48c..8d6615f390f5 100644 > >> > --- a/drivers/infiniband/core/uverbs_cmd.c > >> > +++ b/drivers/infiniband/core/uverbs_cmd.c > >> > @@ -3198,7 +3198,7 @@ static int ib_uverbs_ex_create_flow(struct > >> uverbs_attr_bundle *attrs) > >> > if (cmd.comp_mask) > >> > return -EINVAL; > >> > > >> > - if (!capable(CAP_NET_RAW)) > >> > + if (!ns_capable(current->nsproxy->net_ns->user_ns, CAP_NET_RAW)) > >> > return -EPERM; > >> > >> Looking at the code in drivers/infiniband/core/uverbs_cmd.c > >> I don't think original capable call is actually correct. > >> > >> The problem is that infiniband runs commands through a file descriptor. > >> Which means that anyone who can open the file descriptor and then > >> obtain a program that will work like a suid cat can bypass the > >> current permission check. > >> > >> Before we relax any checks that test needs to be: > >> file_ns_capable(file, &init_user_ns, CAP_NET_RAW); > >> > > > >> Similarly the network namespace you are talking about in those > >> infiniband commands really needs to be derived from the file > >> descriptor instead of current. > >> > > This now start making sense to me. > > When the file descriptor is open, I need to record the net ns and use it for > rest of the life cycle of the process (even if unshare(CLONE_NEWNET) is > called) after opening the file. > > For the rest of the life cycle of the file descriptor. Don't forget that file > descriptors can be passed between processes. > Right. We have seen increasing use of this in rdma applications in recent years. > > Something like how sk_alloc() does sock_net_set(sk, net); > > > > Do I understand you correctly? > > Yes. > > But first. The permission checks need to be fixed to use the cred cached on > the file descriptor. So that the permission checks are not against the current > process, but are against the process that opened the file descriptor. > I am fixing this in the v1. Was trying to find the subsequent patch after the fix on what things to take care of. > Otherwise a non-privileged process can open the file descriptor and trick > another process with more permissions to write the values it wants to have > written to the file descriptor. Usually that is accomplished by tricking some > privileged application to write to stderr (that is passed from the attacker). > > Most of the time you can fix things like that using file_ns_capable. > Other times you encounter a userspace program that breaks and something > else needs to happen. > > >> Those kinds of bugs seem very easy to find in the infiniband code so > >> I have a hunch that the infiniband code needs some tender loving care > >> before it is safe for unprivileged users to be able to do anything with it. > >> > > Well, started to improve now... > > No fault if you haven't lots of code that only root could use no one takes > seriously with respect to security issues, so it tends to be buggy. I have > cleaned up a lot of code over the years before I have relaxed the permissions. > Great feedback, thank you Eric for the direction. I will roll v1 for the fix you suggested and utilize that same infrastructure to honor the non-default user ns. > Eric