Re: [REVIEW][PATCH 0/43] Completing the user namespace

ebiederm@xxxxxxxxxxxx (Eric W. Biederman) · Tue, 10 Apr 2012 16:50:01 -0700

Andrew Lutomirski <luto@xxxxxxx> writes:

> On Tue, Apr 10, 2012 at 2:59 PM, Eric W. Biederman
> <ebiederm@xxxxxxxxxxxx> wrote:
>> Andy Lutomirski <luto@xxxxxxx> writes:
>>>
>>> [...]
>>>
>>> I haven't read enough of the details to figure out how the uid mapping
>>> works (do all the child namespace uids map to the same parent uid?), so
>>> I may be missing some details here.
>>
>> You seem to be missing a detail or two.
>>
>> What you want to look at are the functions make_kuid and from_kuid
>> in kernel/user_namespace.c  You might look at the patches that talk
>> about uidgid.h and introducing a mapping layer.
>>
>> The implementation creates an incomplete but 1-1 mapping to the uids in
>> the initial user namespace.  Which means except for the change in
>> datatype (sigh) the existing permission checks don't need to be changed.
>
> I'll do my homework at the same time that I write up docs for
> no_new_privs (i.e. maybe today).
>
>>
>> My understanding of no_new_privs is that current_cred() including
>> the user, the user namespace and the security label will never change,
>> with the goal of making the security analysis simple.
>
> They can change but only if you already have the privilege to change
> them yourself and then you do so.  For example, PR_SET_NO_NEW_PRIVS,
> setuid, then drop caps is allowed and useful -- it's a race-free way
> to make sure that a given uid never executes without no_new_privs set.
>  I've implemented this as a pam module.

Careful.  There is the security_task_fix_setuid call that will raise
your capabilities from cap->effective to cap->permitted if you call
setuid(0).  Which in the general case means you can regain all of the
root privileges if you only have CAP_SETUID.

> This still simplifies security analysis: the guarantee is that, if
> no_new_privs is set, then a task's children cannot do anything that
> the task could do on it's own.  Therefore it's safe for the task to
> manipulate its own environment in whatever strange ways it wants,
> because even if that gives it the ability to subvert its children,
> there is no privilege gained.

>> I don't recall how seccomp filters are dealt with if you don't have
>> no_new_privs enabled.  If seccomp filters installed by root
>> are dropped when we change privilege levels it might be worth looking
>> at how to keep a seccomp filter installed as long as you stay in
>> a user namespace.
>>
>
> They're not dropped.  I think in the current implementation they can't
> be dropped at all.

Which makes sense.   Is this why you need no_new_privs?  So you can't run
seccomp on higher privileged executables and confusing them into keeping
privileges when they should not?

>> There are essentially two modes you can use the user namespace in:
>> with mappings setup (a privileged operation) and with no mappings.
>
>>
>> With no mappings you can not create a new user namespace or change or
>> uid or gids, and suid exec fails (or possibly ignores the uid/gid change
>> but I am starting with suid exec fails).  Making user namespaces similar
>> to no_new_privs.
>>
>> The emphasis is a bit different from new_new_privs as the user_namespace
>> does not need to guarantee that the lsm will not change security labels,
>> etc.
>
> Hmm.  Is this safe?  For example, if there's a program that LSM policy
> grants extra privileges that malfunctions when run inside a user
> namespace, can that be used to break out of LSM restrictions?

I can't see how it would not be safe.

Except for the user namespace pointer the state the LSM and the rest of
the kernel sees is the same state the kernel sees.  Aka userspace sees
uid 0, the LSM does not.  So I don't know why a LSM would get confused.

Beyond that it is a bug for an LSM to grant permissions beyond the
core DAC model.  So the worst I can see is an LSM not grokking user
namespaces and getting confused and not restricting a process as
much as the designer of the LSM would like.

>> At a basic level of interaction I expect no_new_privs will need to fail
>> any change of the user namespace.  As changing the user namespace
>> changes current_cred(), and fundamentally allows more things to happen.
>
> If a user namespace has no visible effect on processes that aren't
> descendents of whoever created it, then creating one in no_new_privs
> mode should be safe.  On the other hand, it could be somewhat useless.

Creating a user namespace will allowing a process access to more kernel
facilities.  Aka you can (or at least will be able to) create network
namespaces and mount namespaces and the like.  That increases the
surface of the kernel an attacker can hit.

So in a perfect kernel there are no affects on others.  In a scenario
where you are limiting how much of the kernel a user can use I think
you would want that.

Still given that you aren't doing the very restrictive current_cred()
must not change I don't know how it matters, and a bpf based seccomp can
pretty easily filter out new user namespace creation.  Shrug.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html