On Mon, Jan 25, 2010 at 12:47:17PM -0500, Stefan Berger wrote: > Hello! > > The attached patch provides support for the Linux macvtap device for > Qemu by passing a file descriptor to Qemu command line similar to how it > is done with a regular tap device. I have modified the network XML code > to understand a definition as the following one here: > > <network> > <name>vepanet</name> > <uuid>4ebd5168-6321-4757-8397-f6e83484f402</uuid> > <extbridge mode='vepa' dev='eth0'/> > </network> I don't think this is the correct place to be adding this kind of configuration / functionality. The virNetworkPtr / <network> XML is describing a virtual network capability which is *not* directly connected to the LAN. It may be configured to route from the virtual network to the LAN, with optional NAT applied. So while the implementation may use a bridge device, this bridge is not connected to any physical device. Since VEPA is about directly connecting VMs to the LAN, this doesn't really fit here. I'm not sure how familiar people are with the current libvirt object model, so here's a 30 second overview - virDomainPtr / <domain> - management & configuration of the virtual machines & their hardware. - virNetworkPtr / <network> - virtual networks, either isolated or routed (+NAT) to the LAN - virInterfacePtr / <interface> - host interface configuration for physical devices, bridge, bonding & vlan devices - virNodeDevPtr / <device> - host device enumeration / discovery One warning.. - The <domain> XML also contains a element called <interface> which declares a guest NIC. Be careful not to mix up with the similarly named virInterfacePtr XML also using <interface> refering to a host NIC. In the context of bridging a guest to a plain ethernet device, these fit together as follows 1. The virNodeDevPtr APIs are used to discover what physical network devices exist, 'eth0' 2. The virInterfacePtr APIs are used to create a bridge on the host br0, containing the physical device 'eth0' 3. The virDomainPtr APIs are used to define a guest configuration using guest NIC <interface type='bridge'> element, whcih uses a plain TAP device. The logical setup of devices is /-> tap0 -> vm0 pcidev0 -> eth0 -> br0 --> tap1 -> vm1 \-> tap2 -> vm2 Now, throw SR-IOV into the picture.... With SR-IOV there is a single PCI card, with multiple virtual functions. The number of virtual functions is fixed at time of kernel module load so there's no live configuration available from userspace. Every virtual function shows up as a new PCI device in the virNodeDevPtr APIs, and each of those has an associated ethX device visible too. So with SR-IOV plus traditional bridging the same 3 step process above more or less applies unchanged, creating a bridge for each SR-IOV virtual function ethX device & then attaching guests using TAP. The logical setup of devices is a little different pcidev0 -> eth0 -> br0 -> tap0 -> vm0 | pcidev1 -> eth1 -> br1 -> tap1 -> vm1 | pcidev2 -> eth2 -> br2 -> tap2 -> vm2 and you can correlate pcidev0/1/2 to determine they are all virtual function. Of course there's no reason this has to restrict itself to a 1 bridge == 1 guest arrangement - i've just kept this diagram simple. > > > This XML indicates the device that links to the external bridge, here > eth0, and the mode this device is supposed to be set into, here 'vepa'. > If 'extbridge' is found in the XML, all other XML elements that have > been used so far, i.e., bridge, ip, dhcp, etc., are ignored. > The above network description can then be referenced from the virtual > machine definition using the typical interface description of type > 'network': > > <interface type='network'> > <source network='vepanet'/> > <model type='virtio'/> > </interface> > > > The above network XML holds the necessary parameters to open a macvtap > device. Using command line, one would issue the following command to > achieve the same what now libvirt does internally, using a patched > version of the 'ip' command: > > ip link add link eth0 <optional name of if> type macvtap mode vepa > > This then creates the network interface, i.e., macvtap12, along with > entries in /sys/class/net/macvtap12/ where the content of the ifindex > file returns the integer necessary to build the /dev/tap%d device to > open the file descriptor of and pass to qemu via command line argument. So every guest VM NIC gets a new uniqely named macvtap device, and and unqiue TAP device ? IIUC, with a macvlan + a regular NIC we'd thus be seeing something like this arrangement .. /-> macvtap0 -> tap0 -> vm0 pcidev0 -> eth0 --> macvtap1 -> tap1 -> vm1 \-> macvtap2 -> tap2 -> vm2 I guess you could also use this with an SR-IOV card too pcidev0 -> eth0 -> macvtap0 -> tap0 -> vm0 | pcidev1 -> eth1 -> macvtap1 -> tap1 -> vm1 | pcidev2 -> eth2 -> macvtap2 -> tap2 -> vm2 Also, again there's no reason why we'd need to restrict ourselves to only using a single macvtap device with each ethX device / virtual function. The core question to me is whether we need to introduce any new configuration parameters at the host level. It looks to me like it can all be done against the guest XML <interface> element. Specifically, I don't see a need to manage the macvtapN devices directly via the virInterfacePtr APIs. This is good, because that is all delegated to the netcf library, which in turn is just a way to read/write the /etc/sysconfig/networking-scripts. These ifcfg-XXX files have no knowledge of macvtap, and so plumbing all that through the networking-scripts, netcf and libvirt virInterfacePtr APIs is non-trivial work we'd be better off avoiding unless there was a clear compelling need. So unless I'm missing something major in my reasoning here I think in the domain XML we end up with two possible configs for guest network interfaces 1. The current one using plain Linux software bridging, which we can't change in an incompatible way <interface type='bridge'/> <source bridge='br0'/> <target dev='vnet0'/> </interface> Here, the source device is a bridge previously setup to have a physical device enslaved (regular or SR-IOV) The target device is the plain TAP device 2. A new one using hardware bridging, which we can freely define for our new needs <interface type='direct'/> <source dev='eth0' mode='vepa|pepa|bridge'/> <target dev='vnet0'/> </interface> Here, source device is a physical device (regular or SR-IOV). The target device is a macvtap device. In both cases the TAP or macvtap device is created on the fly when the VM is booted & destroyed at shutdown (either by the kernel, or manually by libvirt for macvtap). Both configs are ultimately a form of 'bridging', the difference is that the former is focused on software bridges, as explicitly managed objects separate from the physical device, while the latter is hardware bridges which don't need to be managed separately. > Some details: > > In libvirt I am first searching for an unused interface name following > the template macvtap%d unless the user has provided a name explicitly > or the previous macvtap%d is now taken by another VM. Once that has > succeeded, I follow the path through the filesystem to open the > corresponding /dev/tap%d. > Unlike the regular tap device, where the network interface disappears > once the corresponding file descriptor is closed, the macvtap device > needs explicit tear-down. So, when a VM terminates, I am deleting all > macvtap type interface with the MAC address as the interface of the > terminating VM. Ok, this plan all makes sense. The only thing is that if we want to allow the user to manually specify a macvtap name, we should declare that they are not allowed to use the prefix 'macvtap' for their name. We do similar with TAP, where we declare 'vnet' is a reserved prefix for auto-generated names. > Index: libvirt/src/qemu/qemu_conf.c > =================================================================== > --- libvirt.orig/src/qemu/qemu_conf.c > +++ libvirt/src/qemu/qemu_conf.c > @@ -52,6 +52,7 @@ > #include "nodeinfo.h" > #include "logging.h" > #include "network.h" > +#include "macvtap.h" > #include "cpu/cpu.h" > > #define VIR_FROM_THIS VIR_FROM_QEMU > @@ -1384,6 +1385,34 @@ int qemudExtractVersion(virConnectPtr co > } > > > +static int > +qemudPhysIfaceConnect(virConnectPtr conn, > + virDomainNetDefPtr net, > + char *linkdev, > + char *brmode) > +{ > + int rc; > +#if defined(WITH_MACVTAP) > + char *res_ifname = NULL; > + delMacvtapByMACAddress(conn, net->mac); > + rc = openMacvtapTap(conn, net->ifname, net->mac, linkdev, brmode, > + &res_ifname); > + if (rc > 0) { > + VIR_FREE(net->ifname); > + net->ifname = res_ifname; > + } > +#else > + (void)net; > + (void)linkdev; > + (void)brmode; > + qemudReportError(conn, NULL, NULL, VIR_ERR_INTERNAL_ERROR, > + "%s", _("No support for macvtap device")); > + rc = -1; > +#endif > + return rc; > +} > + > + > int > qemudNetworkIfaceConnect(virConnectPtr conn, > struct qemud_driver *driver, > @@ -1395,6 +1424,7 @@ qemudNetworkIfaceConnect(virConnectPtr c > int tapfd = -1; > int vnet_hdr = 0; > int template_ifname = 0; > + char *brmode = NULL; > > if (net->type == VIR_DOMAIN_NET_TYPE_NETWORK) { > virNetworkPtr network = virNetworkLookupByName(conn, > @@ -1402,6 +1432,15 @@ qemudNetworkIfaceConnect(virConnectPtr c > if (!network) > return -1; > > + if (virNetworkGetExtbridgeData(network, &brname, &brmode) == 0) { > + tapfd = qemudPhysIfaceConnect(conn, net, > + brname, > + brmode); > + VIR_FREE(brname); > + VIR_FREE(brmode); > + return tapfd; > + } > + > brname = virNetworkGetBridgeName(network); > > virNetworkFree(network); As mentioned earlier, we don't want to touch the VIR_DOMAIN_NET_TYPE_NETWORK case, since that's not bridging - its separate virtal networks using NAT. If my idea above is acceptable, we'd be adding a new VIR_DOMAIN_NET_TYPE_DIRECT that we'd hook up. > Index: libvirt/src/util/macvtap.h > =================================================================== > --- /dev/null > +++ libvirt/src/util/macvtap.h > @@ -0,0 +1,48 @@ > +/* > + * Copyright (C) 2010 IBM Corporation > + * > + * This library is free software; you can redistribute it and/or > + * modify it under the terms of the GNU Lesser General Public > + * License as published by the Free Software Foundation; either > + * version 2.1 of the License, or (at your option) any later version. > + * > + * This library is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with this library; if not, write to the Free Software > + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA > + * > + * Authors: > + * Stefan Berger <stefanb@xxxxxxxxxx> > + */ > + > +#ifndef __UTIL_MACVTAP_H__ > +#define __UTIL_MACVTAP_H__ > + > +#include <config.h> > + > +#if defined(WITH_MACVTAP) > + > +#include "internal.h" > + > +int openMacvtapTap(virConnectPtr conn, > + const char *ifname, > + const unsigned char *macaddress, > + const char *linkdev, > + const char *mode, > + char **res_ifname); > + > +int delMacvtapByMACAddress(virConnectPtr conn, > + const unsigned char *macaddress); As an interface for internal usage, this looks like a fine start. > + > +#endif /* WITH_MACVTAP */ > + > +#define MACVTAP_MODE_PRIVATE_STR "private" > +#define MACVTAP_MODE_VEPA_STR "vepa" > +#define MACVTAP_MODE_BRIDGE_STR "bridge" > + > + > +#endif > Index: libvirt/src/Makefile.am > =================================================================== > --- libvirt.orig/src/Makefile.am > +++ libvirt/src/Makefile.am > @@ -55,6 +55,7 @@ UTIL_SOURCES = \ > util/ebtables.c util/ebtables.h \ > util/json.c util/json.h \ > util/logging.c util/logging.h \ > + util/macvtap.c util/macvtap.h \ > util/memory.c util/memory.h \ > util/pci.c util/pci.h \ > util/processinfo.c util/processinfo.h \ > @@ -784,12 +785,15 @@ if WITH_LINUX > USED_SYM_FILES += libvirt_linux.syms > endif > > +USED_SYM_FILES += libvirt_macvtap.syms > + > EXTRA_DIST += \ > libvirt_public.syms \ > libvirt_private.syms \ > libvirt_driver_modules.syms \ > libvirt_bridge.syms \ > - libvirt_linux.syms > + libvirt_linux.syms \ > + libvirt_macvtap.syms > > BUILT_SOURCES = libvirt.syms > > Index: libvirt/src/util/macvtap.c > =================================================================== > --- /dev/null > +++ libvirt/src/util/macvtap.c > @@ -0,0 +1,664 @@ > +/* > + * Copyright (C) 2010 IBM Corporation > + * > + * This library is free software; you can redistribute it and/or > + * modify it under the terms of the GNU Lesser General Public > + * License as published by the Free Software Foundation; either > + * version 2.1 of the License, or (at your option) any later version. > + * > + * This library is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with this library; if not, write to the Free Software > + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA > + * > + * Authors: > + * Stefan Berger <stefanb@xxxxxxxxxx> > + */ > + > +#include <config.h> > + > +#if defined(WITH_MACVTAP) [snip]. I've not had time to look at the details of this macvtap.c code yet, but I assume its doing all you need :-) Is there any benefit to using the network libnl.so library, rather than the ioctl()'s directly ? I'm not too familiar with either, but we already use libnl.so indirectly via netcf, so don't be afraid of adding a dependancy on the libnl.so library if you consider it to be useful. > Index: libvirt/src/conf/network_conf.c > =================================================================== > --- libvirt.orig/src/conf/network_conf.c > +++ libvirt/src/conf/network_conf.c This file shouldn't get any changes really. > Index: libvirt/src/libvirt.c > =================================================================== > --- libvirt.orig/src/libvirt.c > +++ libvirt/src/libvirt.c > @@ -6034,6 +6034,48 @@ error: > return NULL; > } > > + > +/** > + * virNetworkGetExtbridgeData: > + * @network: a network object > + * @linkdev : pointer where name of the interface to connect to the external > + * bridge is returned > + * @brmode : pointer where mode of the external bridge is returned > + * > + * Returns 0 in case an external bridge has been configured, -1 otherwise > + */ > +int > +virNetworkGetExtbridgeData(virNetworkPtr network, > + char **linkdev, char **brmode) > +{ > + virConnectPtr conn; > + DEBUG("network=%p", network); > + > + virResetLastError(); > + > + if (!VIR_IS_CONNECTED_NETWORK(network)) { > + virLibNetworkError(NULL, VIR_ERR_INVALID_NETWORK, __FUNCTION__); > + virDispatchError(NULL); > + return -1; > + } > + > + conn = network->conn; > + > + if (conn->networkDriver && conn->networkDriver->networkGetExtbridgeData) { > + int ret; > + ret = conn->networkDriver->networkGetExtbridgeData(network, > + linkdev, > + brmode); > + return ret; > + } > + > + virLibConnError (conn, VIR_ERR_NO_SUPPORT, __FUNCTION__); > + > + virDispatchError(network->conn); > + return -1; > +} Also don't think we need to be adding extra public APIs for this, if we do it within scope of the domain XML. For the next submission, it'd be very helpful for review if you could split it into 3 patches 1. Adding the extra domain XML syntax to src/conf/domain_conf.{h,c} and docs/schemas/domain.rng 2. Adding the macvtap helper code in src/util/macvtap.{h,c} 3. Implementation for the QEMU driver to use the stuff from the first two patches. Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list