On Thu, Feb 26, 2015 at 04:14:33PM -0600, Christoph Lameter wrote: > > V1->V2: > - Fix up the processing of the caps bits after discussions > with Any and Serge. Make patch less intrusive. > > Ambient caps are something like restricted root privileges. > A process has a set of additional capabilities and those > are inherited without have to set capabilites in other > binaries involved. This allow the partial use of root > like features in a controlled way. It is often useful > to do this for user space device drivers or software that > needs increased priviledges for networking or to control > its own scheduling. Ambient caps allow one to avoid > having to run these with full root priviledges. > > Control over this feature is avaialable via a new > prctl option called PR_CAP_AMBIENT. The second argument to prctl > is a the capability number and the third the desired state. > 0 for off. Otherwise on. > > Ambient bits are enabled regardless of the inheritance > mask of the target binary. They are only restricted > by the bounding set. > > History: > > Linux capabilities have suffered from the problem that they are not > inheritable like unregular process characteristics under Unix. This is > behavior that is counter intuitive to the expected behavior of processes > in Unix. > > In particular there has been recently software that controls NICs from user > space and provides IP stack like behavior also in user space (DPDK and RDMA > kernel API based implementations). Those typically need either capabilities > to allow raw network access or have to be run setsuid. There is scripting and > LD_PREFLOAD etc involved, arbitrary binaries may be run from those scripts > including those setting additional capabilites or requiring root access. > > That does not go well with having file capabilities set that would enable > the capabilities. Maybe it would work if one would setup capabilities on > all executables but that would also defeat a secure design since these > binaries may only need those caps for certain situations. Ok setting the > inheritable flags on everything may also get one there (if there would not > be the issues with LD_PRELOAD, debugging etc etc). > > The easy solution is to allow some capabilities be inherited like setsuid > is. We really prefer to use capabilities instead of setsuid (we want to > limit what damage someone can do after all!). Therefore we have been > running a patch like this in production for the last 6 years. At some > point it becomes tedious to run your own custom kernel so we would like > to have this functionality upstream. > > See some of the earlier related discussions on the problems with capability > inheritance: > > 0. Recent surprise: > https://lkml.org/lkml/2014/1/21/175 > > 1. Attempt to revise caps > http://www.madore.org/~david/linux/newcaps/ > > 2. Problems of passing caps through exec > http://unix.stackexchange.com/questions/128394/passing-capabilities-through-exec > > 3. Problems of binding to privileged ports > http://stackoverflow.com/questions/413807/is-there-a-way-for-non-root-processes-to-bind-to-privileged-ports-1024-on-l > > 4. Reviving capabilities > http://lwn.net/Articles/199004/ > > There does not seem to be an alternative on the horizon. Some involved > in security development under Linux have even stated that they want to > rip out the whole thing and replace it. Its been a couple of years now > and we are still suffering from the capabilities mess. Let us just > fix it. Others have already done implementations like this like Nokia > for the N900. > > > This patch does not change the default behavior but it allows to set up > a list of capabilities via prctl that will enable regular > unix inheritance only for the selected group of capabilities. > > With that it is then possible to do something trivial like setting > CAP_NET_RAW on an executable that can then allow that capability to > be inherited by others. > > Lets have a look at a coding example of a wrapper that enables > a couple of capabilities: > > ------------------------------ ambient_test.c > /* > * Test program for the ambient capabilities > * > * > * Compile using: > * gcc -o ambient_test ambient_test.o > * > * This program must have the following capabilities to run properly: > * CAP_SETPCAP, CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE > * > * A command to equip this with the right caps is: > * > * setcap cap_setpcap,cap_net_raw,cap_net_admin,cap_sys_nice+eip ambient_test > * > * To get a shell with additional caps that can be inherited do: > * > * ./ambient_test /bin/bash > * > */ > > #include <stdlib.h> > #include <stdio.h> > #include <errno.h> > #include <sys/prctl.h> > #include <linux/capability.h> > > /* Defintion to be updated in the user space include files */ > #define PR_CAP_AMBIENT 45 > > int main(int argc, char **argv) > { > int rc; > > if (prctl(PR_CAP_AMBIENT, CAP_NET_RAW)) > perror("Cannot set CAP_NET_RAW"); > > if (prctl(PR_CAP_AMBIENT, CAP_NET_ADMIN)) > perror("Cannot set CAP_NET_ADMIN"); > > if (prctl(PR_CAP_AMBIENT, CAP_SYS_NICE)) > perror("Cannot set CAP_SYS_NICE"); > Your example program is not filling in pI though? Ah, i see why. In get_file_caps() you are still assigning fP = pA if the file has no file capabilities. so then you are actually doing pP' = (X & (fP | pA)) | (pI & (fI | pA)) rather than pP' = (X & fP) | (pI & (fI | pA)) Other than that, the patch is looking good to me. We should consider emitting an audit record when a task fills in its pA, and I do still wonder whether we should be requiring CAP_SETFCAP (unsure how best to think of it). But assuming the fP = pA was not intended, I think this largely does the right thing. > printf("Ambient_test forking shell\n"); > if (execv(argv[1], argv + 1)) > perror("Cannot exec"); > > return 0; > } > -------------------------------- ambient_test.c > > Allows the inheritance of CAP_SYS_NICE, CAP_NET_RAW and CAP_NET_ADMIN. > With that device raw access is possible and also real time priorities > can be set from user space. This is a frequently needed set of > priviledged operations in HPC and HFT applications. User space > processes need to be able to directly access devices as well as > have full control over scheduling. > > Signed-off-by: Christoph Lameter <cl@xxxxxxxxx> > > Index: linux/security/commoncap.c > =================================================================== > --- linux.orig/security/commoncap.c 2015-02-25 13:43:06.929973954 -0600 > +++ linux/security/commoncap.c 2015-02-26 16:10:02.347913397 -0600 > @@ -347,15 +347,17 @@ static inline int bprm_caps_from_vfs_cap > *has_cap = true; > > CAP_FOR_EACH_U32(i) { > + __u32 ambient = current_cred()->cap_ambient.cap[i]; > __u32 permitted = caps->permitted.cap[i]; > __u32 inheritable = caps->inheritable.cap[i]; > > /* > - * pP' = (X & fP) | (pI & fI) > + * pP' = (X & fP) | (pI & (fI | pA)) > */ > new->cap_permitted.cap[i] = > (new->cap_bset.cap[i] & permitted) | > - (new->cap_inheritable.cap[i] & inheritable); > + (new->cap_inheritable.cap[i] & > + (inheritable | ambient)); > > if (permitted & ~new->cap_permitted.cap[i]) > /* insufficient to execute correctly */ > @@ -453,8 +455,18 @@ static int get_file_caps(struct linux_bi > if (rc == -EINVAL) > printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n", > __func__, rc, bprm->filename); > - else if (rc == -ENODATA) > + else if (rc == -ENODATA) { > rc = 0; > + if (!cap_isclear(current_cred()->cap_ambient)) { > + /* > + * The ambient caps are permitted for > + * files that have no caps > + */ > + bprm->cred->cap_permitted = > + current_cred()->cap_ambient; > + *effective = true; > + } > + } > goto out; > } > > @@ -549,9 +561,20 @@ skip: > new->sgid = new->fsgid = new->egid; > > if (effective) > + /* > + * pE' = pP' & (fE | pA) > + * > + * fE is implicity all set if effective == true. > + * Therefore the above reduces to > + * > + * pE' = pP' > + */ > new->cap_effective = new->cap_permitted; > else > cap_clear(new->cap_effective); > + > + /* pA' = pA */ > + new->cap_ambient = old->cap_ambient; > bprm->cap_effective = effective; > > /* > @@ -566,7 +589,7 @@ skip: > * Number 1 above might fail if you don't have a full bset, but I think > * that is interesting information to audit. > */ > - if (!cap_isclear(new->cap_effective)) { > + if (!cap_issubset(new->cap_effective, new->cap_ambient)) { > if (!cap_issubset(CAP_FULL_SET, new->cap_effective) || > !uid_eq(new->euid, root_uid) || !uid_eq(new->uid, root_uid) || > issecure(SECURE_NOROOT)) { > @@ -598,7 +621,7 @@ int cap_bprm_secureexec(struct linux_bin > if (!uid_eq(cred->uid, root_uid)) { > if (bprm->cap_effective) > return 1; > - if (!cap_isclear(cred->cap_permitted)) > + if (!cap_issubset(cred->cap_permitted, cred->cap_ambient)) > return 1; > } > > @@ -933,6 +956,23 @@ int cap_task_prctl(int option, unsigned > new->securebits &= ~issecure_mask(SECURE_KEEP_CAPS); > return commit_creds(new); > > + case PR_CAP_AMBIENT: > + if (!ns_capable(current_user_ns(), CAP_SETPCAP)) > + return -EPERM; > + > + if (!cap_valid(arg2)) > + return -EINVAL; > + > + if (!ns_capable(current_user_ns(), arg2)) > + return -EPERM; > + > + new = prepare_creds(); > + if (arg3 == 0) > + cap_lower(new->cap_ambient, arg2); > + else > + cap_raise(new->cap_ambient, arg2); > + return commit_creds(new); > + > default: > /* No functionality available - continue with default */ > return -ENOSYS; > Index: linux/include/linux/cred.h > =================================================================== > --- linux.orig/include/linux/cred.h 2015-02-25 13:43:06.929973954 -0600 > +++ linux/include/linux/cred.h 2015-02-25 13:43:06.925972078 -0600 > @@ -122,6 +122,7 @@ struct cred { > kernel_cap_t cap_permitted; /* caps we're permitted */ > kernel_cap_t cap_effective; /* caps we can actually use */ > kernel_cap_t cap_bset; /* capability bounding set */ > + kernel_cap_t cap_ambient; /* Ambient capability set */ > #ifdef CONFIG_KEYS > unsigned char jit_keyring; /* default keyring to attach requested > * keys to */ > Index: linux/include/uapi/linux/prctl.h > =================================================================== > --- linux.orig/include/uapi/linux/prctl.h 2015-02-25 13:43:06.929973954 -0600 > +++ linux/include/uapi/linux/prctl.h 2015-02-25 13:43:06.925972078 -0600 > @@ -185,4 +185,7 @@ struct prctl_mm_map { > #define PR_MPX_ENABLE_MANAGEMENT 43 > #define PR_MPX_DISABLE_MANAGEMENT 44 > > +/* Control the ambient capability set */ > +#define PR_CAP_AMBIENT 45 > + > #endif /* _LINUX_PRCTL_H */ > Index: linux/fs/proc/array.c > =================================================================== > --- linux.orig/fs/proc/array.c 2015-02-25 13:43:06.929973954 -0600 > +++ linux/fs/proc/array.c 2015-02-25 13:43:06.925972078 -0600 > @@ -302,7 +302,8 @@ static void render_cap_t(struct seq_file > static inline void task_cap(struct seq_file *m, struct task_struct *p) > { > const struct cred *cred; > - kernel_cap_t cap_inheritable, cap_permitted, cap_effective, cap_bset; > + kernel_cap_t cap_inheritable, cap_permitted, cap_effective, > + cap_bset, cap_ambient; > > rcu_read_lock(); > cred = __task_cred(p); > @@ -310,12 +311,14 @@ static inline void task_cap(struct seq_f > cap_permitted = cred->cap_permitted; > cap_effective = cred->cap_effective; > cap_bset = cred->cap_bset; > + cap_ambient = cred->cap_ambient; > rcu_read_unlock(); > > render_cap_t(m, "CapInh:\t", &cap_inheritable); > render_cap_t(m, "CapPrm:\t", &cap_permitted); > render_cap_t(m, "CapEff:\t", &cap_effective); > render_cap_t(m, "CapBnd:\t", &cap_bset); > + render_cap_t(m, "CapAmb:\t", &cap_ambient); > } > > static inline void task_seccomp(struct seq_file *m, struct task_struct *p) > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html