RE: [PATCH] RDMA/uverbs: Fix CAP_NET_RAW check for flow create in user namespace

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> From: Eric W. Biederman <ebiederm@xxxxxxxxxxxx>
> Sent: Monday, March 10, 2025 11:44 PM
> 
> Parav Pandit <parav@xxxxxxxxxx> writes:
> 
> >> From: Eric W. Biederman <ebiederm@xxxxxxxxxxxx>
> >> Sent: Monday, March 10, 2025 9:59 PM
> >>
> >> Parav Pandit <parav@xxxxxxxxxx> writes:
> >>
> >> > A process running in a non-init user namespace possesses the
> >> > CAP_NET_RAW capability. However, the patch cited in the fixes tag
> >> > checks the capability in the default init user namespace.
> >> > Because of this, when the process was started by Podman in a
> >> > non-default user namespace, the flow creation failed.
> >>
> >> This change isn't a bug fix.  This change is a relaxation of
> >> permissions and it would be very good if this change description described
> why it is in fact safe.
> > As you explained below, it is not safe enough. :) I will improve the
> > change description to reflect as I follow your good suggestions below.
> >
> >>
> >> Many parts of the kernel are not safe for arbitrary users
> >> to use.   In those cases an ordinary capable like you found
> >> is used.
> >>
> > Understood now.
> >
> >> > Fix this issue by checking the CAP_NET_RAW networking capability in
> >> > the owner user namespace that created the network namespace.
> >> >
> >> > This change is similar to the following cited patches.
> >> >
> >> > commit 5e1fccc0bfac ("net: Allow userns root control of the core of
> >> > the network stack.") commit 52e804c6dfaa ("net: Allow userns root
> >> > to control ipv4") commit 59cd7377660a ("net: openvswitch: allow
> >> > conntrack in non-initial user namespace") commit 0a3deb11858a ("fs:
> >> > Allow
> >> > listmount() in foreign mount namespace") commit dd7cb142f467 ("fs:
> >> > relax permissions for listmount()")
> >>
> >> It is different in that hardware is involved.  There is a fair amount
> >> of kernel bypass allowed by design in infiniband so this may indeed
> >> be safe to allow any user on the system to do.  Still for someone who
> >> isn't intimate with infiniband this isn't clear.
> >>
> >> > Fixes: c938a616aadb ("IB/core: Add raw packet QP type")
> >> > Signed-off-by: Parav Pandit <parav@xxxxxxxxxx>
> >> >
> >> > ---
> >> > I would like to have feedback from the LSM experts to make sure
> >> > this fix is correct. Given the widespread usage of the capable()
> >> > call, it makes me wonder if the patch right.
> >> >
> >> > Secondly, I wasn't able to determine which primary namespace (such
> >> > as mount or IPC, etc.) to consider for the CAP_IPC_LOCK capability.
> >> > (not directly related to this patch, but as concept)
> >>
> >> I took a quick look and it appears that no one figures any of the
> >> CAP_IPC_LOCK capability checks are safe for anyone except the global
> >> root user.
> >>
> >> Allowing an arbitrary user to lock all of memory seems to defeat all
> >> of the safeguards that are in place to limiting memory locking.
> >>
> >> It looks like RLIMIT_MEMLOCK has been updated to be per user
> >> namespace (with hierachical limits), so I expect the most reasonable
> >> thing to do is to simply ensure the process that creates the user
> >> namespace has a large enough RLIMIT_MEMLOCK when the user
> namespace is created.
> > Ok, but if infiniband code does capable(), it is going to check the limit
> outside of the user namespace, and the call will still fails.
> > Isn't it?
> 
> It depends on how the check is implemented.  My point is that
> RLIMIT_MEMLOCK has all of the knobs you might need to do something.
> 
> I don't know if the checks you are concerned about allow using
> RLIMIT_MEMLOCK.  Given that some of them require having root in the initial
> user namespace they might make a lot of assumptions.
> 
> But rlimits are related to but separate from capabilities.
>
Ok. I will take this task separately.
 
> > May be the users in non init user ns must run their infiniband application
> without pinning the memory.
> > Aka ODP in infiniband world.
> 
> That sounds right.  I don't remember enough about infiniband to say for
> certain.
> 
> Basically anything that uses ns_capable should be treated as something any
> user can do, and so you need to watch out for hostile users.
> 
Got it.

> >> > ---
> >> >  drivers/infiniband/core/uverbs_cmd.c | 2 +-
> >> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >> >
> >> > diff --git a/drivers/infiniband/core/uverbs_cmd.c
> >> > b/drivers/infiniband/core/uverbs_cmd.c
> >> > index 5ad14c39d48c..8d6615f390f5 100644
> >> > --- a/drivers/infiniband/core/uverbs_cmd.c
> >> > +++ b/drivers/infiniband/core/uverbs_cmd.c
> >> > @@ -3198,7 +3198,7 @@ static int ib_uverbs_ex_create_flow(struct
> >> uverbs_attr_bundle *attrs)
> >> >  	if (cmd.comp_mask)
> >> >  		return -EINVAL;
> >> >
> >> > -	if (!capable(CAP_NET_RAW))
> >> > +	if (!ns_capable(current->nsproxy->net_ns->user_ns, CAP_NET_RAW))
> >> >  		return -EPERM;
> >>
> >> Looking at the code in drivers/infiniband/core/uverbs_cmd.c
> >> I don't think original capable call is actually correct.
> >>
> >> The problem is that infiniband runs commands through a file descriptor.
> >> Which means that anyone who can open the file descriptor and then
> >> obtain a program that will work like a suid cat can bypass the
> >> current permission check.
> >>
> >> Before we relax any checks that test needs to be:
> >> file_ns_capable(file, &init_user_ns, CAP_NET_RAW);
> >>
> >
> >> Similarly the network namespace you are talking about in those
> >> infiniband commands really needs to be derived from the file
> >> descriptor instead of current.
> >>
> > This now start making sense to me.
> > When the file descriptor is open, I need to record the net ns and use it for
> rest of the life cycle of the process (even if unshare(CLONE_NEWNET) is
> called) after opening the file.
> 
> For the rest of the life cycle of the file descriptor.  Don't forget that file
> descriptors can be passed between processes.
>
Right. We have seen increasing use of this in rdma applications in recent years.
 
> > Something like how sk_alloc() does sock_net_set(sk, net);
> >
> > Do I understand you correctly?
> 
> Yes.
> 
> But first.  The permission checks need to be fixed to use the cred cached on
> the file descriptor.  So that the permission checks are not against the current
> process, but are against the process that opened the file descriptor.
> 
I am fixing this in the v1.
Was trying to find the subsequent patch after the fix on what things to take care of.

> Otherwise a non-privileged process can open the file descriptor and trick
> another process with more permissions to write the values it wants to have
> written to the file descriptor.  Usually that is accomplished by tricking some
> privileged application to write to stderr (that is passed from the attacker).
> 
> Most of the time you can fix things like that using file_ns_capable.
> Other times you encounter a userspace program that breaks and something
> else needs to happen.
> 
> >> Those kinds of bugs seem very easy to find in the infiniband code so
> >> I have a hunch that the infiniband code needs some tender loving care
> >> before it is safe for unprivileged users to be able to do anything with it.
> >>
> > Well, started to improve now...
> 
> No fault if you haven't lots of code that only root could use no one takes
> seriously with respect to security issues, so it tends to be buggy.  I have
> cleaned up a lot of code over the years before I have relaxed the permissions.
>
Great feedback, thank you Eric for the direction.
I will roll v1 for the fix you suggested and utilize that same infrastructure to honor the non-default user ns.
 
> Eric





[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux