On Wed, Jan 17, 2018 at 01:49:58PM -0800, Alexander Duyck wrote: > On Wed, Jan 17, 2018 at 11:57 AM, Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: > > On Wed, Jan 17, 2018 at 11:25:41AM -0800, Samudrala, Sridhar wrote: > >> > >> > >> On 1/17/2018 11:02 AM, Michael S. Tsirkin wrote: > >> > On Wed, Jan 17, 2018 at 10:15:52AM -0800, Alexander Duyck wrote: > >> > > On Thu, Jan 11, 2018 at 9:58 PM, Sridhar Samudrala > >> > > <sridhar.samudrala@xxxxxxxxx> wrote: > >> > > > This feature bit can be used by hypervisor to indicate virtio_net device to > >> > > > act as a backup for another device with the same MAC address. > >> > > > > >> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@xxxxxxxxx> > >> > > > --- > >> > > > drivers/net/virtio_net.c | 2 +- > >> > > > include/uapi/linux/virtio_net.h | 3 +++ > >> > > > 2 files changed, 4 insertions(+), 1 deletion(-) > >> > > > > >> > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > >> > > > index 12dfc5fee58e..f149a160a8c5 100644 > >> > > > --- a/drivers/net/virtio_net.c > >> > > > +++ b/drivers/net/virtio_net.c > >> > > > @@ -2829,7 +2829,7 @@ static struct virtio_device_id id_table[] = { > >> > > > VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \ > >> > > > VIRTIO_NET_F_CTRL_MAC_ADDR, \ > >> > > > VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \ > >> > > > - VIRTIO_NET_F_SPEED_DUPLEX > >> > > > + VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP > >> > > > > >> > > > static unsigned int features[] = { > >> > > > VIRTNET_FEATURES, > >> > > > diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h > >> > > > index 5de6ed37695b..c7c35fd1a5ed 100644 > >> > > > --- a/include/uapi/linux/virtio_net.h > >> > > > +++ b/include/uapi/linux/virtio_net.h > >> > > > @@ -57,6 +57,9 @@ > >> > > > * Steering */ > >> > > > #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */ > >> > > > > >> > > > +#define VIRTIO_NET_F_BACKUP 62 /* Act as backup for another device > >> > > > + * with the same MAC. > >> > > > + */ > >> > > > #define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set linkspeed and duplex */ > >> > > > > >> > > > #ifndef VIRTIO_NET_NO_LEGACY > >> > > I'm not a huge fan of the name "backup" since that implies that the > >> > > Virtio interface is only used if the VF is not present, and there are > >> > > multiple instances such as dealing with east/west or > >> > > broadcast/multicast traffic where it may be desirable to use the > >> > > para-virtual interface rather then deal with PCI overhead/bottleneck > >> > > to send the packet. > >> > Right now hypervisors mostly expect that yes, only one at a time is > >> > used. E.g. if you try to do multicast sending packets on both VF and > >> > virtio then you will end up with two copies of each packet. > >> > >> I think we want to use only 1 interface to send out any packet. In case of > >> broadcast/multicasts it would be an optimization to send them via virtio and > >> this patch series adds that optimization. > > > > Right that's what I think we should rather avoid for now. > > > > It's *not* an optimization if there's a single VM on this host, > > or if a specific multicast group does not have any VMs on same > > host. > > Agreed. In my mind this is something that is controlled by the > pass-thru interface once it is enslaved. It would be pretty tricky to control through the PT interface since a PT interface pretends to be a physical device, which has no concept of VMs. > > I'd rather we just sent everything out on the PT if that's > > there. The reason we have virtio in the picture is just so > > we can migrate without downtime. > > I wasn't saying we do that in all cases. That would be something that > would have to be decided by the pass-thru interface. Ideally the > virtio would provide just enough information to get itself into the > bond and I see this being the mechanism for it to do so. From there > the complexity mostly lies in the pass-thru interface to configure the > correct transmit modes if for example you have multiple pass-thru > interfaces or a more complex traffic setup due to things like > SwitchDev. > > In my mind we go the bonding route and there are few use cases for all > of this. First is the backup case that is being addressed here. That > becomes your basic "copy netvsc" approach for this which would be > default. It is how we would handle basic pass-thru back-up paths. If > the host decides to send multicast/broadcast traffic from the host up > through it that is a host side decision. I am okay with our default > transmit behavior from the guest being to send everything through the > pass-thru interface. All the virtio would be doing is advertising that > it is a side channel for some sort of bond with another interface. By > calling it a side channel it implies that it isn't the direct channel > for the interface, but it is an alternative communications channel for > things like backup, and other future uses. > > There end up being a few different "phase 2" concepts I have floating > around which I want to avoid limiting. Solving the east/west problem > is an example. In my mind that just becomes a bonding Tx mode that is > controlled via the pass-thru interface. The VF could black-list the > local addresses so that they instead fall into the virtio interface. > In addition I seem to recall watching a presentation from Mellanox > where they were talking about a VF per NUMA node in a system which > would imply multiple VFs with the same MAC address. I'm not sure how > the current patch handles that. Obviously that would require a more > complex load balancing setup. The bonding solution ends up being a way > to resolve that so that they could just have it take care of picking > the right Tx queue based on the NUMA affinity and fall back to the > virtio/netvsc when those fail. The way I see it, we'd need to pass a bunch of extra information host to guest, and we'd have to use a PV interface for it. When we do this, we'll need to have another feature bit, and we can call it SIDE_CHANNEL or whatever. > >> In the receive path, the broadcasts should only go the PF and reach the VM > >> via vitio so that the VM doesn't see duplicate broadcasts. > >> > >> > >> > > >> > To me the east/west scenario looks like you want something > >> > more similar to a bridge on top of the virtio/PT pair. > >> > > >> > So I suspect that use-case will need a separate configuration bit, > >> > and possibly that's when you will want something more powerful > >> > such as a full bridge. > >> > >> east-west optimization of unicasts would be a harder problem to solve as > >> somehow the hypervisor needs to indicate the VM about the local dest. macs > > > > Using a bridge with a dedicated device for east/west would let > > bridge use standard learning techniques for that perhaps? > > I'm not sure about having the guest have to learn. It's certainly a way to do this, but certainly not the only one. > In my mind the VF > should know who is local and who isn't. Right. But note that these things change. > In the case of SwitchDev it > should be possible for the port representors and the switch to provide > data on which interfaces are bonded on the host side and which aren't. > With that data it would be pretty easy to just put together a list of > addresses that would prefer to go the para-virtual route instead of > being transmitted through physical hardware. > > In addition a bridge implies much more overhead since normally a > bridge can receive a packet in on one interface and transmit it on > another. We don't really need that. This is more of a VEPA type setup > and doesn't need to be anything all that complex. You could probably > even handle the Tx queue selection via a simple eBPF program and map > since the input for whatever is used to select Tx should be pretty > simple, destination MAC, source NUMA node, etc, and the data-set > shouldn't be too large. That sounds interesting. A separate device might make this kind of setup a bit easier. Sridhar, did you look into creating a separate device for the virtual bond device at all? It does not have to be in a separate module, that kind of refactoring can come later, but once we commit to using the same single device as virtio, we can't change that. > >> > > What if instead of BACKUP we used the name SIDE_CHANNEL? Basically it > >> > > is a bit of double entendre as we are using the physical MAC address > >> > > to provide configuration information, and then in addition this > >> > > interface acts as a secondary channel for passing frames to and from > >> > > the guest rather than just using the VF. > >> > > > >> > > Just a thought. > >> > > > >> > > Thanks. > >> > > > >> > > - Alex > >> > I just feel that's a very generic name, not conveying enough information > >> > about how they hypervisor expects the pair of devices to be used. > > I would argue that BACKUP is too specific. I can see many other uses > other than just being a backup interface True but the ones you listed above all seem to require virtio changes anyway, we'll be able to add a new feature or rename this one as we make them. > and I don't want us to box > ourselves in. I agree that it makes sense for active/backup to be the > base mode, but I really don't think it conveys all of the > possibilities since I see this as possibly being able to solve > multiple issues as this evolves. I'm also open to other suggestions. > We could go with PASSTHRU_SIDE_CHANNEL for instance, I just didn't > suggest that since it seemed kind of wordy. There's only so much info a single bit can confer :) So we can always rename later, the point is to draw the line and say "this is the functionality current hosts expect". -- MST _______________________________________________ Virtualization mailing list Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/virtualization