On Tue, Dec 2, 2014 at 3:07 PM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: > Andy Lutomirski <luto@xxxxxxxxxxxxxx> writes: > >> On Tue, Dec 2, 2014 at 1:45 PM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: >>> Andy Lutomirski <luto@xxxxxxxxxxxxxx> writes: >>> >>>> On Tue, Dec 2, 2014 at 12:28 PM, Eric W. Biederman >>>> <ebiederm@xxxxxxxxxxxx> wrote: >>>>> >>>>> - Expose the knob to user space through a proc file /proc/<pid>/setgroups >>>> >>>> Can you rename this to something clearer, e.g. userns_setgroups_mode? >>> >>> I am not certain that is any clearer. That just reads as wordier. >>> >>> The userns bit is definitely confusing and wrong. Why should we need to >>> spell out the scope? >> >> Because it's a property of the process' userns, not a property of the process. >> >> userns_setgroups would be okay. (Arguably it should've been >> userns_uid_map, too, but at least uid_map sounds obviously >> namespace-related.) >> >>> >>>>> A value of 0 means the setgroups system call is disabled in the >>>>> current processes user namespace and can not be enabled in the >>>>> future in this user namespace. >>>>> >>>>> A value of 1 means the segtoups system call is enabled. >>>> >>>> Would it make more sense to put strings like "allow" and "deny" in >>>> here? That way, future extensions could add additional values. >>> >>> If the implementation of the write side isn't too bad. I would love >>> to see precedent elsewhere in the kernel. What I don't expect to do >>> is have any values except setgroups are enabled and setgroups are >>> disabled. >> >> current_clocksource? I think there are lots of things like that. >> >>> >>> I am fine with allowing for the possibility (that is just good >>> engineering) but I don't intend to seriously considering or >>> implementing other possibilities. >> >> I can imagine a mode where there's a fixed set of groups that are >> forced set but other groups can be added and dropped. It would be >> complicated to set up right, but someone might want it some day. > > Yeah. I am defintiely not designing for that. > >>>>> diff --git a/arch/s390/kernel/compat_linux.c b/arch/s390/kernel/compat_linux.c >>>>> index 21c91feeca2d..6d0ee1b089fb 100644 >>>>> --- a/arch/s390/kernel/compat_linux.c >>>>> +++ b/arch/s390/kernel/compat_linux.c >>>>> @@ -252,6 +252,7 @@ COMPAT_SYSCALL_DEFINE2(s390_setgroups16, int, gidsetsize, u16 __user *, grouplis >>>>> int retval; >>>>> >>>>> if (!gid_mapping_possible(user_ns) || >>>>> + !atomic_read(&user_ns->setgroups_allowed) || >>>>> !capable(CAP_SETGID)) >>>>> return -EPERM; >>>> >>>> This is now incomprehensible because of the gid_mapping_possible >>>> thing. If you renamed gid_mapping_possible to >>>> userns_setgroup_allowed, then this could be added to the >>>> implementation, and this would all make sense (not to mention avoiding >>>> duplicating this thing). >>>> >>>>> @@ -826,6 +827,11 @@ static bool new_idmap_permitted(const struct file *file, >>>>> kuid_t uid = make_kuid(ns->parent, id); >>>>> if (uid_eq(uid, cred->euid)) >>>>> return true; >>>>> + } else if (cap_setid == CAP_SETGID) { >>>>> + kgid_t gid = make_kgid(ns->parent, id); >>>>> + if (!atomic_read(&ns->setgroups_allowed) && >>>>> + gid_eq(gid, cred->egid)) >>>>> + return true; >>>> >>>> I still don't see why egid is any better than fsgid here. >>> >>> Answered in my earlier response fsgid was a goof. >>> I can set any gid to my egid with my existing permissions. >>> Show me how I can do that with fsgid or fsuid and I will be happy to use >>> those. >> >> You can use your fsgid to make a setgid file, I think. But yes, point >> taken, although as mentioned in the other thread, I think it would be >> a lot clearer if it were a separate patch. > > Agreed. > >>>>> +ssize_t proc_setgroups_write(struct file *file, const char __user *buf, >>>>> + size_t count, loff_t *ppos) >>>>> +{ >>>>> + struct seq_file *seq = file->private_data; >>>>> + struct user_namespace *ns = seq->private; >>>>> + char kbuf[3]; >>>>> + int setgroups_allowed; >>>>> + ssize_t ret; >>>>> + >>>>> + ret = -EPERM; >>>>> + if (!file_ns_capable(file, ns, CAP_SETGID)) >>>>> + goto out; >>>> >>>> CAP_SYS_ADMIN? This isn't setting a gid in the namespace; it's >>>> reconfiguring the namespace. >>> >>> Hmm. Maybe. It is an activity that is normally controlled by >>> CAP_SETGID. >>> >>> Frankly I think the entire split up of the capability model is almost >>> totally broken. But I think CAP_SETGID is a close approximation of the >>> right thing here. >> >> I agree that the cap model is screwy. But we use CAP_SYS_ADMIN for >> everything else that changes the overall behavior of a namespace. >> >> In any event, the only way it matters is for a non-ns owner in the >> parent ns with CAP_SETGID set but not CAP_SYS_ADMIN. I'd argue that >> CAP_SETGID should not be usable to make unrelated process' syscalls >> fail. > > That is a pretty decent argument. I will take a good stare at this > issue as I refresh these patches and see how close to perfection I can > make them. > >>>>> + /* Only allow a very narrow range of strings to be written */ >>>>> + ret = -EINVAL; >>>>> + if ((*ppos != 0) || (count >= sizeof(kbuf)) || (count < 1)) >>>>> + goto out; >>>>> + >>>>> + /* What was written? */ >>>>> + ret = -EFAULT; >>>>> + if (copy_from_user(kbuf, buf, count)) >>>>> + goto out; >>>>> + kbuf[count] = '\0'; >>>>> + >>>>> + /* What is being requested? */ >>>>> + ret = -EINVAL; >>>>> + if (kbuf[0] == '0') >>>>> + setgroups_allowed = 0; >>>>> + else if (kbuf[0] == '1') >>>>> + setgroups_allowed = 1; >>>>> + else >>>>> + goto out; >>>>> + >>>>> + /* Allow a trailing newline */ >>>>> + ret = -EINVAL; >>>>> + if ((count == 2) && (kbuf[1] != '\n')) >>>>> + goto out; >>>>> + >>>>> + >>>>> + if (setgroups_allowed) { >>>>> + ret = -EINVAL; >>>>> + if (atomic_read(&ns->setgroups_allowed) == 0) >>>>> + goto out; >>>>> + } else { >>>> >>>> I would disallow this if gid_map has been written in the interest of >>>> sanity. >>> >>> Not a chance. That is part of making this an independent knob. If >>> there is another reason for disabling setgroups you can flip this >>> knob even after mappings are established. >> >> Then you really want CAP_SYS_ADMIN, I think. >> >>> >>>>> + atomic_set(&ns->setgroups_allowed, 0); >>>>> + /* sigh memory barriers! */ >>>> >>>> I don't think that any barriers are needed. If you ever observe >>>> setgroups_allowed == 0, it will stay that way forever. >>> >>> Likely. In practice the code works today. >>> >>> But I need to review things closely to understand if there are barriers >>> needed. But especially since it is a write once knob we can get away >>> with a lot. >>> >> >> Yeah. >> >> For long-term use, I kind of like the flags approach I added in the >> other patch. It makes it easy to add more flags in the future. > > I thought a dedicated word might be easier. I don't know that it makes > much of a difference at this point. > >> In any event, I think the only barrier that's needed is when writing gid_map: >> >> atomic_read / test_bit to determine whether unpriv mappings are okay; >> smp_mb() or whatever the current _after_atomic thing is these days; >> write mapping; >> >> Although I'm not sure whether Linux supports any architectures that >> can violate causality in the way that barrier is there to prevent. > > The DEC Alpha at least used to be able to violate causality in ways > weird enough to confuse anyone, and the alpha still seems to be in the > tree. What keyed me in was the part in atomic_ops.txt that says these > things don't have barriers and you will probably need them, and I > remember spin locks tend to take care of all those issues for you. > > Anyway thanks for your barrier pointer. > My pointer was a bit off. Barriers synchronize with barriers; the check for whether setgroups is okay may need a barrier as well: static inline bool may_setgroups(whatever) { if (no groups are mapped) return false; smp_rmb() (or _before_atomic, maybe -- I don't know whether that's okay for atomic_read); if (setgroups is disallowed by option) return false; return true; } It would be bad if there were a window in which we'd see a group mapped but not see that setgroups is disallowed. --Andy _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers