Re: [PATCH rdma-next 5/5] RDMA/core: Add command to set ib_core device net namspace sharing mode

Doug Ledford <dledford@xxxxxxxxxx> · Wed, 20 Feb 2019 13:42:14 -0500

On Wed, 2019-02-20 at 18:03 +0000, Parav Pandit wrote:
> > -----Original Message-----
> > From: Jason Gunthorpe
> > Sent: Wednesday, February 20, 2019 11:56 AM
> > To: Parav Pandit <parav@xxxxxxxxxxxx>
> > Cc: Doug Ledford <dledford@xxxxxxxxxx>; Leon Romanovsky
> > <leon@xxxxxxxxxx>; Leon Romanovsky <leonro@xxxxxxxxxxxx>; RDMA
> > mailing list <linux-rdma@xxxxxxxxxxxxxxx>
> > Subject: Re: [PATCH rdma-next 5/5] RDMA/core: Add command to set
> > ib_core device net namspace sharing mode
> > 
> > On Wed, Feb 20, 2019 at 10:52:16AM -0700, Parav Pandit wrote:
> > 
> > > Yes. we have the module parameter option in this series.
> > > I came across a user who didn't have LOM nics.
> > > They are directly using rdma nics in their cluster as primary and only
> > > interface.
> > 
> > This is very common for IB clusters, a dedicated ethernet management
> > network is a very expensive component at large scale.
> > 
> > > I do not know if such IB based networks exist.  And if they do, when
> > > they change this mode, they will have connectivity loss.
> > 
> > Or they have to change modes before setting up ipoib. It is much less useful.
> > 
> > > So we probably shouldn't be doing client unregister-register sequence
> > > as part of this sys operation done by advance user.
> > 
> > Provide a 'rdma ulp-restart' netlink command that does the enable/disable
> > sequence?
> > 
> Probably we should define more generic rdma dev up/down (start/stop) API that network manager sw can consume in sw.

I was thinking more along the lines of trying to change the compat dev
structure.  Right now, it only contains enough data for sysfs entries
and port attributes, but actual file opens go to the parent device.  If
you changed that, and created a full alias device, then you could change
the logic like so:

For rdma_dev_access_netns:
	return (net_eq(read_pnet(&dev->rdma_net, net) &&
		(!(dev->flags & INIT_NET_COPY) ||
		  ib_device_shared_netns));

So now you have to both have shared netns on and be attempting the open
via a default shared device in that namespace, or you have to be opening
a non-default namespace specific device for this namespace.

Then, when you call the netlink command with shared mode off, but not
with disconnnect, all we do is unset ib_device_shared_netns and people
will no longer be able to connect via a non-init_net namespace to any of
the INIT_NET_COPY devices.

When you call the netlink command with shared mode off, and with
disconnect true, then we unset ib_device_shared_netns and we also go
through and delete all of the devs with INIT_NET_COPY in their flags. 
Those devs need to be how the processes opened the namespace device, and
we need to track enough stuff in those devs that we can pass that dev to
the normal destroy function for ib devices and let it tear it down like
it would a real device, taking all of the opens, pds, mrs, and
everything else right along with it.

What this really makes me think is that we don't want this alias device
model we have now.  We want full ib_device copies (which we will need
for the non-default copy case anyway...if an admin wants to add an RDMA
device to a new ns, and wants to control things like P_Keys allowed,
then we need to be able to fully configure that device).  Then we can
always shut it down forcefully as needed.

I really don't like the disconnect/reconnect model.  There's no reason
someone with a valid namespace association at the time we make this
change should see anything happen.  Just tear down what's invalid, and
leave the rest alone.

-- 
Doug Ledford <dledford@xxxxxxxxxx>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD
Attachment:
signature.asc

Description: This is a digitally signed message part