Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 2015-05-25 at 16:32 -0600, Jason Gunthorpe wrote:
> On Tue, May 26, 2015 at 12:50:52AM +0300, Or Gerlitz wrote:
> > On Tue, May 26, 2015 at 12:14 AM, Jason Gunthorpe
> > <jgunthorpe@xxxxxxxxxxxxxxxxxxxx> wrote:
> > > On Mon, May 25, 2015 at 11:04:41PM +0300, Or Gerlitz wrote:
> > >
> > >> OK, so rewinding a bit, the IB VF [1] identity is their 8 bytes port
> > >> GUID, and as Jason noted the user/kernel API allows to deliver up to
> > >> 32 bytes between user and kernel under the set_vf_mac flow
> > >> (do_setvfinfo() in net/core/rtnetlink.c). Trying it out through
> > >> **non-modified** ip tool and net/core/rtnetlink.c things just work -
> > >> I can set eight bytes value to be the virtual port GUID :
> > >
> > > Was I not perfectly clear? You have to use the 20 byte LLADDR format here:
> > >
> > >> # ip link set dev ib0 vf 1 mac aa:bb:cc:dd:ee:ff:11:22
> > 
> > Jason,
> > 
> > I am aiming to provision the VF IB end-node address == port GUID (vGUID)
> > in the same manner that VF Eth end-node address is their MAC, not
> > more, not less.
> 
> I perfectly understand what you are trying to do.
> 
> I care about the design and consistency of netlink - and that means
> there is one LLADDR definition for a net device, and every single netlink
> message that touches a LLADDR uses that definition - for IPoIB that is 20
> bytes.

So, here's my $.02.

First, 8 bytes is the right size for the data to be transferred.  The
remaining 12 bytes of the IPoIB LLADDR are generated constructs at least
partially controlled by elements outside of this machine, and moreover,
they are *not* the *actual* LLADDR of the underlying IB device and the
20 byte LLADDR never hits the wire.  So, it's an artificial construct.

The fact that the 20 byte LLADDR of IPoIB is problematic is born out by
multiple facts:  we only use 8 bytes in DHCP requests and responses,
only use 8 bytes in matching IPoIB netdevices against HWADDR entries in
the sysconfig/ifcfg-* files, only use 8 bytes for matching the udev
rules to rename IPoIB devices, etc.

This clearly demonstrates that attaching the netlink ndo points to the
IPoIB netdevice is, let's just say, a shortcut.  The *real* device that
we should be manipulating is not the IPoIB device, but the verbs device.
If we were manipulating the verbs device, no one would be arguing for
anything other than the exact proper item: an 8 byte GUID.  Of course,
the problem then becomes an issue of "can we attach the underlying verbs
device to the netlink ndo mechanism?"  And the answer there is "Not
easily, and Dave Miller would likely throw a fit" (and rightfully so, as
all netdevice entries are assumed to be IP capable and we aren't, and so
we can't really be used by the core of the net stack).

Since I don't see us getting permission from Dave to attach verbs
devices to the netstack just so we can A) be ignored by the core of the
netstack while also B) having only 3 or 4 of the ndo_ entry points
defined so we can co-opt their infrastructure for our own uses.

That leaves using the IPoIB device.  In which case, here's what I want:

If a user passes in a MAC addresses, please use the following checks:

if (mac_len != 8 && mac_len != 20)
	err out
set guid as lower 8 bytes of mac, leave remainder untouched regardless
of what's passed in

You will note that if someone passes in a mac size of 8, this is not an
error.  It is consistent with the existing NetworkManager device
settings, ifup-ib device settings, and udev device settings that an 8
byte MAC for IPoIB devices is the right thing to uniquely identify the
device.

Here's why I disagree with Jason.

If you have an app that queries a device, sees that it has a 20byte MAC,
and then tries to set that MAC, it may have no specific knowledge of how
IPoIB relates to the underlying verbs device and could just be trying to
set a MAC blindly.  And that's ok, but we aren't going to honor the
entire MAC they pass in, and that's likely to confuse the app (or the
person running the app).

On the other hand, if we advertise a 20 byte MAC, and the app/person
passes in an 8 byte MAC instead, then that implicitly denotes that the
app/person *knows* this is an IPoIB device, and *knows* that we are
actually setting the underlying GUID of the verbs device, and is aware
of and prepared for the fact that the entire 20 byte MAC is not in fact
settable.  IPoIB and verbs devices are an exception to the rest of the
netdevice stack, and this usage tacitly acknowledges that fact.  So I am
of the opinion that if an app/user knows exactly how accessing the IPoIB
device is merely as a passthrough to the verbs device and that 8 bytes
is the right size, then we should accept it instead of requiring them to
supply fake data they know won't be used.

> To violate that design invariant needs an incredibly strong argument,
> and you haven't made it.

Hopefully my argument satisfies your requirements.

> This allows generic code to work with the LLADDR - for instance 'ip
> link set vf mac' should have checked the size of the IFLA_ADDRESS and
> demanded that the address argument is the same number of bytes. It is
> very broken the command happily accepts an 8 byte and 6 byte argument
> for the same device.
> 
> Yes, it is ugly that the PF side's ndo_get_vf_config cannot return the
> same 20 byte address of the VF's ipoib interface, but I think that is
> less ugly than forcing a different address format just for the vf calls.
> 
> If you have doubts then *ask netdev*. Ask them if ndo_set_vf_mac must
> follow the same address size and format as IFLA_ADDRESS, or if we can
> use something else.
> 
> Such a netlink architecture choice is beyond the authority of
> linux-rdma.
> 
> Jason


-- 
Doug Ledford <dledford@xxxxxxxxxx>
              GPG KeyID: 0E572FDD

Attachment: signature.asc
Description: This is a digitally signed message part


[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux