Re: [CFT][PATCH 2/3] userns: Add a knob to disable setgroups on a per user namespace basis

Andy Lutomirski <luto@xxxxxxxxxxxxxx> · Tue, 2 Dec 2014 15:17:32 -0800

On Tue, Dec 2, 2014 at 3:07 PM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
> Andy Lutomirski <luto@xxxxxxxxxxxxxx> writes:
>
>> On Tue, Dec 2, 2014 at 1:45 PM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
>>> Andy Lutomirski <luto@xxxxxxxxxxxxxx> writes:
>>>
>>>> On Tue, Dec 2, 2014 at 12:28 PM, Eric W. Biederman
>>>> <ebiederm@xxxxxxxxxxxx> wrote:
>>>>>
>>>>> - Expose the knob to user space through a proc file /proc/<pid>/setgroups
>>>>
>>>> Can you rename this to something clearer, e.g. userns_setgroups_mode?
>>>
>>> I am not certain that is any clearer.  That just reads as wordier.
>>>
>>> The userns bit is definitely confusing and wrong.  Why should we need to
>>> spell out the scope?
>>
>> Because it's a property of the process' userns, not a property of the process.
>>
>> userns_setgroups would be okay.  (Arguably it should've been
>> userns_uid_map, too, but at least uid_map sounds obviously
>> namespace-related.)
>>
>>>
>>>>>   A value of 0 means the setgroups system call is disabled in the
>>>>>   current processes user namespace and can not be enabled in the
>>>>>   future in this user namespace.
>>>>>
>>>>>   A value of 1 means the segtoups system call is enabled.
>>>>
>>>> Would it make more sense to put strings like "allow" and "deny" in
>>>> here?  That way, future extensions could add additional values.
>>>
>>> If the implementation of the write side isn't too bad.  I would love
>>> to see precedent elsewhere in the kernel.    What I don't expect to do
>>> is have any values except setgroups are enabled and setgroups are
>>> disabled.
>>
>> current_clocksource?  I think there are lots of things like that.
>>
>>>
>>> I am fine with allowing for the possibility (that is just good
>>> engineering) but I don't intend to seriously considering or
>>> implementing other possibilities.
>>
>> I can imagine a mode where there's a fixed set of groups that are
>> forced set but other groups can be added and dropped.  It would be
>> complicated to set up right, but someone might want it some day.
>
> Yeah.  I am defintiely not designing for that.
>
>>>>> diff --git a/arch/s390/kernel/compat_linux.c b/arch/s390/kernel/compat_linux.c
>>>>> index 21c91feeca2d..6d0ee1b089fb 100644
>>>>> --- a/arch/s390/kernel/compat_linux.c
>>>>> +++ b/arch/s390/kernel/compat_linux.c
>>>>> @@ -252,6 +252,7 @@ COMPAT_SYSCALL_DEFINE2(s390_setgroups16, int, gidsetsize, u16 __user *, grouplis
>>>>>         int retval;
>>>>>
>>>>>         if (!gid_mapping_possible(user_ns) ||
>>>>> +           !atomic_read(&user_ns->setgroups_allowed) ||
>>>>>             !capable(CAP_SETGID))
>>>>>                 return -EPERM;
>>>>
>>>> This is now incomprehensible because of the gid_mapping_possible
>>>> thing.  If you renamed gid_mapping_possible to
>>>> userns_setgroup_allowed, then this could be added to the
>>>> implementation, and this would all make sense (not to mention avoiding
>>>> duplicating this thing).
>>>>
>>>>> @@ -826,6 +827,11 @@ static bool new_idmap_permitted(const struct file *file,
>>>>>                         kuid_t uid = make_kuid(ns->parent, id);
>>>>>                         if (uid_eq(uid, cred->euid))
>>>>>                                 return true;
>>>>> +               } else if (cap_setid == CAP_SETGID) {
>>>>> +                       kgid_t gid = make_kgid(ns->parent, id);
>>>>> +                       if (!atomic_read(&ns->setgroups_allowed) &&
>>>>> +                           gid_eq(gid, cred->egid))
>>>>> +                               return true;
>>>>
>>>> I still don't see why egid is any better than fsgid here.
>>>
>>> Answered in my earlier response fsgid was a goof.
>>> I can set any gid to my egid with my existing permissions.
>>> Show me how I can do that with fsgid or fsuid and I will be happy to use
>>> those.
>>
>> You can use your fsgid to make a setgid file, I think.  But yes, point
>> taken, although as mentioned in the other thread, I think it would be
>> a lot clearer if it were a separate patch.
>
> Agreed.
>
>>>>> +ssize_t proc_setgroups_write(struct file *file, const char __user *buf,
>>>>> +                            size_t count, loff_t *ppos)
>>>>> +{
>>>>> +       struct seq_file *seq = file->private_data;
>>>>> +       struct user_namespace *ns = seq->private;
>>>>> +       char kbuf[3];
>>>>> +       int setgroups_allowed;
>>>>> +       ssize_t ret;
>>>>> +
>>>>> +       ret = -EPERM;
>>>>> +       if (!file_ns_capable(file, ns, CAP_SETGID))
>>>>> +               goto out;
>>>>
>>>> CAP_SYS_ADMIN?  This isn't setting a gid in the namespace; it's
>>>> reconfiguring the namespace.
>>>
>>> Hmm.  Maybe.  It is an activity that is normally controlled by
>>> CAP_SETGID.
>>>
>>> Frankly I think the entire split up of the capability model is almost
>>> totally broken.  But I think CAP_SETGID is a close approximation of the
>>> right thing here.
>>
>> I agree that the cap model is screwy.  But we use CAP_SYS_ADMIN for
>> everything else that changes the overall behavior of a namespace.
>>
>> In any event, the only way it matters is for a non-ns owner in the
>> parent ns with CAP_SETGID set but not CAP_SYS_ADMIN.  I'd argue that
>> CAP_SETGID should not be usable to make unrelated process' syscalls
>> fail.
>
> That is a pretty decent argument.    I will take a good stare at this
> issue as I refresh these patches and see how close to perfection I can
> make them.
>
>>>>> +       /* Only allow a very narrow range of strings to be written */
>>>>> +       ret = -EINVAL;
>>>>> +       if ((*ppos != 0) || (count >= sizeof(kbuf)) || (count < 1))
>>>>> +               goto out;
>>>>> +
>>>>> +       /* What was written? */
>>>>> +       ret = -EFAULT;
>>>>> +       if (copy_from_user(kbuf, buf, count))
>>>>> +               goto out;
>>>>> +       kbuf[count] = '\0';
>>>>> +
>>>>> +       /* What is being requested? */
>>>>> +       ret = -EINVAL;
>>>>> +       if (kbuf[0] == '0')
>>>>> +               setgroups_allowed = 0;
>>>>> +       else if (kbuf[0] == '1')
>>>>> +               setgroups_allowed = 1;
>>>>> +       else
>>>>> +               goto out;
>>>>> +
>>>>> +       /* Allow a trailing newline */
>>>>> +       ret = -EINVAL;
>>>>> +       if ((count == 2) && (kbuf[1] != '\n'))
>>>>> +               goto out;
>>>>> +
>>>>> +
>>>>> +       if (setgroups_allowed) {
>>>>> +               ret = -EINVAL;
>>>>> +               if (atomic_read(&ns->setgroups_allowed) == 0)
>>>>> +                       goto out;
>>>>> +       } else {
>>>>
>>>> I would disallow this if gid_map has been written in the interest of
>>>> sanity.
>>>
>>> Not a chance.  That is part of making this an independent knob.  If
>>> there is another reason for disabling setgroups you can flip this
>>> knob even after mappings are established.
>>
>> Then you really want CAP_SYS_ADMIN, I think.
>>
>>>
>>>>> +               atomic_set(&ns->setgroups_allowed, 0);
>>>>> +               /* sigh memory barriers! */
>>>>
>>>> I don't think that any barriers are needed.  If you ever observe
>>>> setgroups_allowed == 0, it will stay that way forever.
>>>
>>> Likely.   In practice the code works today.
>>>
>>> But I need to review things closely to understand if there are barriers
>>> needed.  But especially since it is a write once knob we can get away
>>> with a lot.
>>>
>>
>> Yeah.
>>
>> For long-term use, I kind of like the flags approach I added in the
>> other patch.  It makes it easy to add more flags in the future.
>
> I thought a dedicated word might be easier.  I don't know that it makes
> much of a difference at this point.
>
>> In any event, I think the only barrier that's needed is when writing gid_map:
>>
>> atomic_read / test_bit to determine whether unpriv mappings are okay;
>> smp_mb() or whatever the current _after_atomic thing is these days;
>> write mapping;
>>
>> Although I'm not sure whether Linux supports any architectures that
>> can violate causality in the way that barrier is there to prevent.
>
> The DEC Alpha at least used to be able to violate causality in ways
> weird enough to confuse anyone, and the alpha still seems to be in the
> tree.  What keyed me in was the part in atomic_ops.txt that says these
> things don't have barriers and you will probably need them, and I
> remember spin locks tend to take care of all those issues for you.
>
> Anyway thanks for your barrier pointer.
>

My pointer was a bit off.  Barriers synchronize with barriers; the
check for whether setgroups is okay may need a barrier as well:

static inline bool may_setgroups(whatever) {
    if (no groups are mapped)
      return false;
    smp_rmb() (or _before_atomic, maybe -- I don't know whether that's
okay for atomic_read);
    if (setgroups is disallowed by option)
      return false;

    return true;
}

It would be bad if there were a window in which we'd see a group
mapped but not see that setgroups is disallowed.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html