Re: opening tap devices that are created in a container

Martin Kletzander <mkletzan@xxxxxxxxxx> · Tue, 17 Jul 2018 13:45:57 +0200

[Not sure who got this message, but it probably didn't get anywhere due to one
mailserver, so resending to make sure]

On Mon, Jul 09, 2018 at 05:00:49PM -0400, Jason Baron wrote:
On 07/08/2018 02:01 AM, Martin Kletzander wrote:
On Thu, Jul 05, 2018 at 06:24:20PM +0200, Roman Mohr wrote:
On Thu, Jul 5, 2018 at 4:20 PM Jason Baron <jbaron@xxxxxxxxxx> wrote:

Hi,

Opening tap devices, such as macvtap, that are created in containers is
problematic because the interface for opening tap devices is via
/dev/tapNN and devtmpfs is not typically mounted inside a container as
its not namespace aware. It is possible to do a mknod() in the
container, once the tap devices are created, however, since the tap
devices are created dynamically its not possible to apriori allow access
to certain major/minor numbers, since we don't know what these are going
to be. In addition, its desirable to not allow the mknod capability in
containers. This behavior, I think is somewhat inconsistent with the
tuntap driver where one can create tuntap devices inside a container by
first opening /dev/net/tun and then using them by supplying the tuntap
device name via the ioctl(TUNSETIFF). And since TUNSETIFF validates the
network namespace, one is limited to opening network devices that belong
to your current network namespace.

Here are some options to this issue, that I wanted to get feedback
about, and just wondering if anybody else has run into this.

1)

Don't create the tap device, such as macvtap in the container. Instead,
create the tap device outside of the container and then move it into the
desired container network namespace. In addition, do a mknod() for the
corresponding /dev/tapNN device from outside the container before doing
chroot().

This solution still doesn't allow tap devices to be created inside the
container. Thus, in the case of kubevirt, which runs libvirtd inside of
a container, it would mean changing libvirtd to open existing tap
devices (as opposed to the current behavior of creating new ones). This
would not require any kernel changes, but as mentioned seems
inconsistent with the tuntap interface.

For KubeVirt, apart from how exactly the device ends up in the
container, I
would want to pursue a way where all network preparations which require
privileges happens from a privileged process *outside* of the container.
Like CNI solutions do it. They run outside, have privileges and then
create
devices in the right network/mount namespace or move them there. The
final
goal for KubeVirt is that our pod with the qemu process is completely
unprivileged and privileged setup happens from outside.

As a consequence, and depending on which route Dan pursues with the
restructured libvirt, I would assume that either a privileged
libvirtd-part
outside of containers creates the devices by entering the right
namespaces,
or that libvirt in the container can consume pre-created tun/tap devices,
like qemu.

That would be nice, but as far as I understand there will always be a
need for
some privileges if you want to use a tap device.  It's nice that CNI
does that
and all the containers can run unprivileged, but that's because they do
not open
the tap device and they do not do any privileged operations on it.  But
QEMU
needs to.  So the only way would be passing an opened fd to the
container or
opening the tap device there and making the fd usable for one process in
the
container.  Is this already supported for some type of containers in
some way?

Martin

Hi,

So another option here call it #3 is to pass open fds via unix sockets.
If there are privileged operations that QEMU is trying to do with the fd
though, how will opening it first and then passing it to an unprivileged
QEMU address that? Is the opener doing those operations first?

Sorry for the confusion, but QEMU is not doing any privileged operations.  I got
confused by the fact that anyone can open and do a R/W on a tap device.  But it
looks like that's on purpose.  No capabilities are needed for opening
/dev/net/tun and calling ioctl(TUNSETIFF) with existing name and then doing R/W
operations on it.  It just works.

Correct me if I'm wrong, but to sum it all up, the only things that we need to
figure out (which might possibly be solved by ideas in the other thread) are:

tap:
- Existence of /dev/net/tun
- Having permissions to open it (0666 by default, shouldn't be a nig deal)
- Knowing the device name

macvtap:
- Existence of /dev/tapXX
- Having permissions to open /dev/tapXX
- One of the following:
 - Knowing the device name (and being able to translate it using a netlink socket)
 - Knowing the the device index

The rest should be an implementation detail.

Am I right?  Did I miss anything?
Attachment:
signature.asc

Description: Digital signature
--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list