On Mon, Jul 09, 2018 at 05:00:49PM -0400, Jason Baron wrote: > > > On 07/08/2018 02:01 AM, Martin Kletzander wrote: > > On Thu, Jul 05, 2018 at 06:24:20PM +0200, Roman Mohr wrote: > >> On Thu, Jul 5, 2018 at 4:20 PM Jason Baron <jbaron@xxxxxxxxxx> wrote: > >> > >>> Hi, > >>> > >>> Opening tap devices, such as macvtap, that are created in containers is > >>> problematic because the interface for opening tap devices is via > >>> /dev/tapNN and devtmpfs is not typically mounted inside a container as > >>> its not namespace aware. It is possible to do a mknod() in the > >>> container, once the tap devices are created, however, since the tap > >>> devices are created dynamically its not possible to apriori allow access > >>> to certain major/minor numbers, since we don't know what these are going > >>> to be. In addition, its desirable to not allow the mknod capability in > >>> containers. This behavior, I think is somewhat inconsistent with the > >>> tuntap driver where one can create tuntap devices inside a container by > >>> first opening /dev/net/tun and then using them by supplying the tuntap > >>> device name via the ioctl(TUNSETIFF). And since TUNSETIFF validates the > >>> network namespace, one is limited to opening network devices that belong > >>> to your current network namespace. > >>> > >>> Here are some options to this issue, that I wanted to get feedback > >>> about, and just wondering if anybody else has run into this. > >>> > >>> 1) > >>> > >>> Don't create the tap device, such as macvtap in the container. Instead, > >>> create the tap device outside of the container and then move it into the > >>> desired container network namespace. In addition, do a mknod() for the > >>> corresponding /dev/tapNN device from outside the container before doing > >>> chroot(). > >>> > >>> This solution still doesn't allow tap devices to be created inside the > >>> container. Thus, in the case of kubevirt, which runs libvirtd inside of > >>> a container, it would mean changing libvirtd to open existing tap > >>> devices (as opposed to the current behavior of creating new ones). This > >>> would not require any kernel changes, but as mentioned seems > >>> inconsistent with the tuntap interface. > >>> > >> > >> For KubeVirt, apart from how exactly the device ends up in the > >> container, I > >> would want to pursue a way where all network preparations which require > >> privileges happens from a privileged process *outside* of the container. > >> Like CNI solutions do it. They run outside, have privileges and then > >> create > >> devices in the right network/mount namespace or move them there. The > >> final > >> goal for KubeVirt is that our pod with the qemu process is completely > >> unprivileged and privileged setup happens from outside. > >> > >> As a consequence, and depending on which route Dan pursues with the > >> restructured libvirt, I would assume that either a privileged > >> libvirtd-part > >> outside of containers creates the devices by entering the right > >> namespaces, > >> or that libvirt in the container can consume pre-created tun/tap devices, > >> like qemu. > >> > > > > That would be nice, but as far as I understand there will always be a > > need for > > some privileges if you want to use a tap device. It's nice that CNI > > does that > > and all the containers can run unprivileged, but that's because they do > > not open > > the tap device and they do not do any privileged operations on it. But > > QEMU > > needs to. So the only way would be passing an opened fd to the > > container or > > opening the tap device there and making the fd usable for one process in > > the > > container. Is this already supported for some type of containers in > > some way? > > > > Martin > > Hi, > > So another option here call it #3 is to pass open fds via unix sockets. > If there are privileged operations that QEMU is trying to do with the fd > though, how will opening it first and then passing it to an unprivileged > QEMU address that? Is the opener doing those operations first? >From libvirt's POV, it would be preferrable to be able to open the macvtap device by name inside the container, rather than having to accept a pre-opened FD from the application. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list