> On 21 Mar 2019, at 10:58, Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: > > On Thu, Mar 21, 2019 at 12:19:22AM +0200, Liran Alon wrote: >> >> >>> On 21 Mar 2019, at 0:10, Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: >>> >>> On Wed, Mar 20, 2019 at 11:43:41PM +0200, Liran Alon wrote: >>>> >>>> >>>>> On 20 Mar 2019, at 16:09, Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: >>>>> >>>>> On Wed, Mar 20, 2019 at 02:23:36PM +0200, Liran Alon wrote: >>>>>> >>>>>> >>>>>>> On 20 Mar 2019, at 12:25, Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: >>>>>>> >>>>>>> On Wed, Mar 20, 2019 at 01:25:58AM +0200, Liran Alon wrote: >>>>>>>> >>>>>>>> >>>>>>>>> On 19 Mar 2019, at 23:19, Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: >>>>>>>>> >>>>>>>>> On Tue, Mar 19, 2019 at 08:46:47AM -0700, Stephen Hemminger wrote: >>>>>>>>>> On Tue, 19 Mar 2019 14:38:06 +0200 >>>>>>>>>> Liran Alon <liran.alon@xxxxxxxxxx> wrote: >>>>>>>>>> >>>>>>>>>>> b.3) cloud-init: If configured to perform network-configuration, it attempts to configure all available netdevs. It should avoid however doing so on net-failover slaves. >>>>>>>>>>> (Microsoft has handled this by adding a mechanism in cloud-init to blacklist a netdev from being configured in case it is owned by a specific PCI driver. Specifically, they blacklist Mellanox VF driver. However, this technique doesn’t work for the net-failover mechanism because both the net-failover netdev and the virtio-net netdev are owned by the virtio-net PCI driver). >>>>>>>>>> >>>>>>>>>> Cloud-init should really just ignore all devices that have a master device. >>>>>>>>>> That would have been more general, and safer for other use cases. >>>>>>>>> >>>>>>>>> Given lots of userspace doesn't do this, I wonder whether it would be >>>>>>>>> safer to just somehow pretend to userspace that the slave links are >>>>>>>>> down? And add a special attribute for the actual link state. >>>>>>>> >>>>>>>> I think this may be problematic as it would also break legit use case >>>>>>>> of userspace attempt to set various config on VF slave. >>>>>>>> In general, lying to userspace usually leads to problems. >>>>>>> >>>>>>> I hear you on this. So how about instead of lying, >>>>>>> we basically just fail some accesses to slaves >>>>>>> unless a flag is set e.g. in ethtool. >>>>>>> >>>>>>> Some userspace will need to change to set it but in a minor way. >>>>>>> Arguably/hopefully failure to set config would generally be a safer >>>>>>> failure. >>>>>> >>>>>> Once userspace will set this new flag by ethtool, all operations done by other userspace components will still work. >>>>> >>>>> Sorry about being unclear, the idea would be to require the flag on each ethtool operation. >>>> >>>> Oh. I have indeed misunderstood your previous email then. :) >>>> Thanks for clarifying. >>>> >>>>> >>>>>> E.g. Running dhclient without parameters, after this flag was set, will still attempt to perform DHCP on it and will now succeed. >>>>> >>>>> I think sending/receiving should probably just fail unconditionally. >>>> >>>> You mean that you wish that somehow kernel will prevent Tx on net-failover slave netdev >>>> unless skb is marked with some flag to indicate it has been sent via the net-failover master? >>> >>> We can maybe avoid binding a protocol socket to the device? >> >> That is indeed another possibility that would work to avoid the DHCP issues. >> And will still allow checking connectivity. So it is better. >> However, I still think it provides an non-intuitive customer experience. >> In addition, I also want to take into account that most customers are expected a 1:1 mapping between a vNIC and a netdev. >> i.e. A cloud instance should show 1-netdev if it has one vNIC attached to it defined. >> Customers usually don’t care how they get accelerated networking. They just care they do. >> >>> >>>> This indeed resolves the group of userspace issues around performing DHCP on net-failover slaves directly (By dracut/initramfs, dhclient and etc.). >>>> >>>> However, I see a couple of down-sides to it: >>>> 1) It doesn’t resolve all userspace issues listed in this email thread. For example, cloud-init will still attempt to perform network config on net-failover slaves. >>>> It also doesn’t help with regard to Ubuntu’s netplan issue that creates udev rules that match only by MAC. >>> >>> >>> How about we fail to retrieve mac from the slave? >> >> That would work but I think it is cleaner to just not bind PV and VF based on having the same MAC. > > There's a reference to that under "Non-MAC based pairing". > > I'll look into making it more explicit. Yes I know. I was referring to what you described in that section. > >>> >>>> 2) It brings non-intuitive customer experience. For example, a customer may attempt to analyse connectivity issue by checking the connectivity >>>> on a net-failover slave (e.g. the VF) but will see no connectivity when in-fact checking the connectivity on the net-failover master netdev shows correct connectivity. >>>> >>>> The set of changes I vision to fix our issues are: >>>> 1) Hide net-failover slaves in a different netns created and managed by the kernel. But that user can enter to it and manage the netdevs there if wishes to do so explicitly. >>>> (E.g. Configure the net-failover VF slave in some special way). >>>> 2) Match the virtio-net and the VF based on a PV attribute instead of MAC. (Similar to as done in NetVSC). E.g. Provide a virtio-net interface to get PCI slot where the matching VF will be hot-plugged by hypervisor. >>>> 3) Have an explicit virtio-net control message to command hypervisor to switch data-path from virtio-net to VF and vice-versa. Instead of relying on intercepting the PCI master enable-bit >>>> as an indicator on when VF is about to be set up. (Similar to as done in NetVSC). >>>> >>>> Is there any clear issue we see regarding the above suggestion? >>>> >>>> -Liran >>> >>> The issue would be this: how do we avoid conflicting with namespaces >>> created by users? >> >> This is kinda controversial, but maybe separate netns names into 2 groups: hidden and normal. >> To reference a hidden netns, you need to do it explicitly. >> Hidden and normal netns names can collide as they will be maintained in different namespaces (Yes I’m overloading the term namespace here…). > > Maybe it's an unnamed namespace. Hidden until userspace gives it a name? This is also a good idea that will solve the issue. Yes. > >> Does this seems reasonable? >> >> -Liran > > Reasonable I'd say yes, easy to implement probably no. But maybe I > missed a trick or two. BTW, from a practical point of view, I think that even until we figure out a solution on how to implement this, it was better to create an kernel auto-generated name (e.g. “kernel_net_failover_slaves") that will break only userspace workloads that by a very rare-chance have a netns that collides with this then the breakage we have today for the various userspace components. -Liran > >>> >>>>> >>>>>> Therefore, this proposal just effectively delays when the net-failover slave can be operated on by userspace. >>>>>> But what we actually want is to never allow a net-failover slave to be operated by userspace unless it is explicitly stated >>>>>> by userspace that it wishes to perform a set of actions on the net-failover slave. >>>>>> >>>>>> Something that was achieved if, for example, the net-failover slaves were in a different netns than default netns. >>>>>> This also aligns with expected customer experience that most customers just want to see a 1:1 mapping between a vNIC and a visible netdev. >>>>>> But of course maybe there are other ideas that can achieve similar behaviour. >>>>>> >>>>>> -Liran >>>>>> >>>>>>> >>>>>>> Which things to fail? Probably sending/receiving packets? Getting MAC? >>>>>>> More? >>>>>>> >>>>>>>> If we reach >>>>>>>> to a scenario where we try to avoid userspace issues generically and >>>>>>>> not on a userspace component basis, I believe the right path should be >>>>>>>> to hide the net-failover slaves such that explicit action is required >>>>>>>> to actually manipulate them (As described in blog-post). E.g. >>>>>>>> Automatically move net-failover slaves by kernel to a different netns. >>>>>>>> >>>>>>>> -Liran >>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> MST _______________________________________________ Virtualization mailing list Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/virtualization