On Thu, Jun 12, 2014 at 08:48:25AM +0200, Cédric Bosdonnat wrote: > Added <capabilities> in the <features> section of LXC domains > configuration. This section can contain elements named after the > capabilities like: > > <mknod state="on"/>, keep CAP_MKNOD capability > <sys_chroot state="off"/> drop CAP_SYS_CHROOT capability > > Users can restrict or give more capabilities than the default using > this mechanism. > --- > docs/schemas/domaincommon.rng | 196 ++++++++++++++++++++++++ > src/conf/domain_conf.c | 93 ++++++++++- > src/conf/domain_conf.h | 47 ++++++ > src/libvirt_private.syms | 1 + > src/lxc/lxc_cgroup.c | 5 + > src/lxc/lxc_container.c | 90 +++++++++-- > tests/domainschemadata/domain-caps-features.xml | 28 ++++ > 7 files changed, 442 insertions(+), 18 deletions(-) > create mode 100644 tests/domainschemadata/domain-caps-features.xml > diff --git a/src/lxc/lxc_cgroup.c b/src/lxc/lxc_cgroup.c > index 8dfdc60..71a0d61 100644 > --- a/src/lxc/lxc_cgroup.c > +++ b/src/lxc/lxc_cgroup.c > @@ -357,6 +357,11 @@ static int virLXCCgroupSetupDeviceACL(virDomainDefPtr def, > {'c', LXC_DEV_MAJ_FUSE, LXC_DEV_MIN_FUSE}, > {0, 0, 0}}; > > + /* No white list if CAP_MKNOD has to be kept */ > + int capMknod = def->caps_features[VIR_DOMAIN_CAPS_FEATURE_MKNOD]; > + if (capMknod == VIR_DOMAIN_FEATURE_STATE_ON) > + return 0; > + > if (virCgroupDenyAllDevices(cgroup) < 0) > goto cleanup; So wrt device nodes we have two layers of defence - Blocking CAP_MKNOD - this means user can only have access to device nodes that are present in the /dev that we populate. - CGroups ACL blocking mkdir, read and write for everything except the device nodes that are implied by (or explicitly requested in) the XML I can see the value in allowing CAP_MKNOD because if we have granted the user '/dev/foo' access, there's no harm in letting them "mknod /dev/foo" too. If we have granted CAP_MKNOD though I'm not convinced that this implies we should allow access to any possible device ever. So at most I think we should allow 'mknod' in the cgroups, but still keep read+write blocked. We definitely shouldn't allow read+write just because they requested CAP_MKNOD. We already let users request access to arbitrary devices in the XML config to deal with the latter. > - if ((ret = capng_updatev(CAPNG_DROP, > - CAPNG_EFFECTIVE | CAPNG_PERMITTED | > - CAPNG_INHERITABLE | CAPNG_BOUNDING_SET, > - CAP_SYS_MODULE, /* No kernel module loading */ > - CAP_SYS_TIME, /* No changing the clock */ > - CAP_MKNOD, /* No creating device nodes */ > - CAP_AUDIT_CONTROL, /* No messing with auditing status */ > - CAP_MAC_ADMIN, /* No messing with LSM config */ > - keepReboot ? -1 : CAP_SYS_BOOT, /* No use of reboot */ > - -1)) < 0) { > - virReportError(VIR_ERR_INTERNAL_ERROR, > - _("Failed to remove capabilities: %d"), ret); > - return -1; > + for (i = 0; i < VIR_DOMAIN_CAPS_FEATURE_LAST; i++) { > + bool toDrop = false; > + int state = def->caps_features[i]; > + > + switch ((virDomainCapsFeature) i) { > + case VIR_DOMAIN_CAPS_FEATURE_SYS_BOOT: /* No use of reboot */ > + toDrop = !keepReboot && (state != VIR_DOMAIN_FEATURE_STATE_ON); > + break; > + case VIR_DOMAIN_CAPS_FEATURE_SYS_MODULE: /* No kernel module loading */ > + case VIR_DOMAIN_CAPS_FEATURE_SYS_TIME: /* No changing the clock */ > + case VIR_DOMAIN_CAPS_FEATURE_MKNOD: /* No creating device nodes */ > + case VIR_DOMAIN_CAPS_FEATURE_AUDIT_CONTROL: /* No messing with auditing status */ > + case VIR_DOMAIN_CAPS_FEATURE_MAC_ADMIN: /* No messing with LSM config */ > + toDrop = (state != VIR_DOMAIN_FEATURE_STATE_ON); > + break; > + default: /* User specified capabilities to drop */ > + toDrop = (state == VIR_DOMAIN_FEATURE_STATE_OFF); > + } > + > + if (toDrop && (ret = capng_update(CAPNG_DROP, > + CAPNG_EFFECTIVE | CAPNG_PERMITTED | > + CAPNG_INHERITABLE | CAPNG_BOUNDING_SET, > + capsMapping[i])) < 0) { > + virReportError(VIR_ERR_INTERNAL_ERROR, > + _("Failed to remove capability %s: %d"), > + virDomainCapsFeatureTypeToString(i), ret); > + return -1; > + } So what I don't like about the current code is that the set of 5 capabilities we drop is essentially arbitrary and has no inherant value. - If user namespaces are not active, then the container is insecure even if we block those 5 caps, so they add no security benefit. - If user namespaces are active, then the container is secure even if we allow these 5 caps, so again they add no security benefit. If this were day 1, I'd just allow all possible capabilities by default but that would be a semantic change - particularly changing from block CAP_MKNOD to allowing CAP_MKNOD would cause a regression in any containers using systemd. My concern, however, is that a user who wants to run a container with absolutely everything blocked, has to list them all in the XML, and the kernel adds new capabilities over time. So they might have a config which is secure today where they've listed everything to drop, then the kernel adds a new capabilty and they become potentially less secure since their existing config doesn't know about the new capability name. Looking at the XML you've proposed below: > +<domain type='lxc'> > + <name>demo</name> > + <uuid>8369f1ac-7e46-e869-4ca5-759d51478066</uuid> > + <os> > + <type>exe</type> > + <init>/sh</init> > + </os> > + <features> > + <capabilities> > + <mknod state="on"/> > + </capabilities> > + </features> I think what's missing here is some notion of the initial policy. I think the <capabilities> element could usefully grow a new 'policy' attribute eg <features> <capabilities policy="default|deny|allow"> <mknod state="on"/> </capabilities> </features> Now this would mean: - default == the current historical hypervisor specific default behaviour. ie what libvirt LXC driver does today. can allow/deny any caps to change this default - deny == all CAPS blocked by default. Must whitelist any caps to be allowed in container - allow == all CAPS allowed by default. Must whitelist any caps to be blocked from container And the default policy would be 'default' for sake of back compat. The final thing that concerns me is that this list of caps names is pretty much Linux specific. eg Solaris caps will potentially have different names. I don't see a good way to deal with that other than to say if we ever need to support this on Solaris, we'll extend the list of caps names to be the union of all names used by Linux and Solaris, and only process the caps relevant to each OS. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list