RE: [PATCH rdma-next v1 2/3] RDMA/core: Introduce a helper function to change net namespace of rdma device

Parav Pandit <parav@xxxxxxxxxxxx> · Wed, 10 Apr 2019 20:39:17 +0000



> -----Original Message-----
> From: Jason Gunthorpe <jgg@xxxxxxxx>
> Sent: Wednesday, April 10, 2019 2:36 PM
> To: Parav Pandit <parav@xxxxxxxxxxxx>
> Cc: Leon Romanovsky <leon@xxxxxxxxxx>; Doug Ledford
> <dledford@xxxxxxxxxx>; Leon Romanovsky <leonro@xxxxxxxxxxxx>; RDMA
> mailing list <linux-rdma@xxxxxxxxxxxxxxx>
> Subject: Re: [PATCH rdma-next v1 2/3] RDMA/core: Introduce a helper
> function to change net namespace of rdma device
> 
> On Wed, Apr 10, 2019 at 07:11:41PM +0000, Parav Pandit wrote:
> > > > +
> > > >  static void rdma_dev_exit_net(struct net *net)  {
> > > >  	struct rdma_dev_net *rnet = net_generic(net, rdma_dev_net_id);
> > > @@
> > > > -983,6 +1013,7 @@ static void rdma_dev_exit_net(struct net *net)
> > > >  	}
> > > >  	up_read(&devices_rwsem);
> > > >
> > > > +	devices_to_default_net(net);
> > > >  	xa_erase(&rdma_nets, rnet->id);
> > > >  }
> > > >
> > > > @@ -1425,6 +1456,98 @@ void ib_unregister_device_queued(struct
> > > > ib_device *ib_dev)  }  EXPORT_SYMBOL(ib_unregister_device_queued);
> > > >
> > > > +static int __rdma_dev_change_netns(struct ib_device *device,
> > > > +struct net *net) {
> > > > +	int ret;
> > > > +
> > > > +	/* If the refcount is zero, it means disable_device() has progressed
> > > > +	 * first during unregistration phase. There is no point in changing net
> > > > +	 * namespace of device which is on path of removal.
> > > > +	 */
> > > > +	if (refcount_read(&device->refcount) == 0)
> > > > +		return -ENODEV;
> > > > +
> > > > +	disable_device(device);
> > > > +
> > > > +	/* At this point no one would be using the device,
> > > > +	 * so it is safe to change the device net namespace.
> > > > +	 */
> > > > +	write_pnet(&device->coredev.rdma_net, net);
> > > > +	/* Currently rdma devices are system wide unique. So device
> > > > +	 * name is guaranteed free in new namespace. Publish new
> > > > +	 * namespace at sysfs level.
> > > > +	 */
> > > > +	ret = device_rename(&device->dev, dev_name(&device->dev));
> > > > +	if (ret) {
> > > > +		dev_warn(&device->dev, "%s Couldn't rename device\n",
> > > __func__);
> > > > +		return ret;
> > >
> > > Hurm.. Technically we are supposed to hold the devices_rwsem if we
> > > are manipulating the name. Inferring that we don't have to because
> > > the refcount is 0 is a little over optimizing for my tastes
> >
> > So I was asking comment section in at beginning of device.c to be
> > updated in internal thread or github (don't recall) when these were
> > made xarray, to describe that it devices_rwsem protects rename
> > operation too.  That makes it crystal clear that that rename has to be
> > protected using devices_rwsem.  And that it can be added here to
> > protect it without relying on refcount.
> 
> It is not so much rename, as the devices_rwsem protects the device list and
> everything intrinsically unique about it. ie you can't read or write the name
> without holding that lock..
> 
> > > > +int rdma_dev_change_netns_with_put(struct ib_device *dev, struct
> > > > +net
> > > > +*net) {
> > >
> > > Missing static?
> > >
> > Because this function is being called in next patch.
> 
> I don't have this patch..
> 
> > > > +	 * Currently supported only for those providers which supports
> > > > +	 * disassociation and those which doesn't do port specific sysfs
> > > > +	 * init. Once a port_cleanup infrastructure is implemented, this
> > > > +	 * limitation will be removed.
> > > > +	 */
> > > > +	if (!dev->ops.disassociate_ucontext || dev->ops.init_port) {
> > > > +		ib_device_put(dev);
> > > > +		return -EOPNOTSUPP;
> > > > +	}
> > >
> > > I feel like this should also be prevented if ib_devices_shared_netns??
> > >
> > compat devices are added in net ns other than where it currently lives.
> > add_one_compat_dev() takes care to not any compat devices in in which it
> lives.
> > So it looks fine to me where ib_devices_shared_netns flag is honored.
> 
> More of a philisophical point.. I don't want to see people running NS aware
> code without disabling compat mode at all. I don't want to see them get into
> some weird state where devices beyond what they expect are showing up in
> their containers because the flag wasn't disabled.
> 
> > > > +	/* If net namespace was changed by the time this thread gets to it,
> > > > +	 * return an error.
> > > > +	 */
> > > > +	if (!net_eq(current->nsproxy->net_ns,
> > > > +		    read_pnet(&dev->coredev.rdma_net))) {
> > > > +		ret = -ENODEV;
> > > > +		goto done;
> > > > +	}
> > >
> > > And this needs to be inside __rdma_dev_change_netns as the other
> > > caller technically needs it too. .. and if we are doing that then
> > > the lock should be absorbed into __rdma_dev_change_netns as well so
> > > it is a self-consistent function.
> > >
> > It was here because I do not wanted to check the net while moving the
> device to init_net.
> > But since init_net cannot be deleted, it is fine to merge it inside
> __rdma_dev_change_netns.
> > It just confuses on why are we checking it while moving to init_net....
> > A comment would help to clarify that.
> 
> We are checking atomically that the device is still in the NS that is being
> deleted before moving it out. Otherwise the user could create a sub
> namespace, delete the parent namespace, move the device into the sub and
> trigger some race where the device escapes its namespace when it shouldn't.
> 
> > > What about the
> > > 	kobject_uevent(&dev->dev.kobj, KOBJ_REMOVE);
> > > 	kobject_uevent(&dev->dev.kobj, KOBJ_ADD);
> > >
> > > Stuff that netdev does? Shouldn't we do that too? KOBJ_REMOVE is
> > > part of triggering userspace to let go of its processes too..
> > >
> > I am not sure if there are any rdma application that would be
> > listening to the broadcast netlink messages coming out of this.  And
> > if someone report it, it can be added later.
> 
> We use this already with systemd. You should test it. Make sure srp_daemon
> stops and starts as appropriate.
> 
> > > +/*
> > > + * The caller must pass in a device that has the kref held and the
> > > +refcount
> > > + * released. If the device is in cur_net and still registered then
> > > +it is moved
> > > + * into net.
> > > + */
> > > +static int rdma_dev_change_netns(struct ib_device *device, struct
> > > +net
> > > *cur_net,
> > > +				 struct net *net)
> > >  {
> > > +	int ret2 = -EINVAL;
> > >  	int ret;
> > >
> > > -	/* If the refcount is zero, it means disable_device() has progressed
> > > -	 * first during unregistration phase. There is no point in changing net
> > > -	 * namespace of device which is on path of removal.
> > > +	mutex_lock(&device->unregistration_lock);
> > > +
> > > +	/*
> > > +	 * If a device not under ib_device_get() or the unregistration_lock
> > > +	 * the namespace can be changed, or it can be unregistered. Check
> > > +	 * again under the lock.
> > >  	 */
> > > -	if (refcount_read(&device->refcount) == 0)
> > > -		return -ENODEV;
> > > +	if (!net_eq(current->nsproxy->net_ns,
> > > +		    read_pnet(&device->coredev.rdma_net)) ||
> > > +	    refcount_read(&device->refcount) == 0) {
> > > +		ret = -ENODEV;
> > > +		goto out;
> > > +	}
> > We should read the refcount before reading net ns of the device.
> 
> Yes, probably.
> 
> > >
> > > +	kobject_uevent(&device->dev.kobj, KOBJ_REMOVE);
> > >  	disable_device(device);
> > >
> > > -	/* At this point no one would be using the device,
> > > -	 * so it is safe to change the device net namespace.
> > > +	/*
> > > +	 * At this point no one can be using the device, so it is safe to
> > > +	 * change the namespace.
> > >  	 */
> > >  	write_pnet(&device->coredev.rdma_net, net);
> > > -	/* Currently rdma devices are system wide unique. So device
> > > -	 * name is guaranteed free in new namespace. Publish new
> > > -	 * namespace at sysfs level.
> > > +
> > > +	/*
> > > +	 * Currently rdma devices are system wide unique. So the device
> > > name
> > > +	 * is guaranteed free in the new namespace. Publish the new
> > > namespace
> > > +	 * at the sysfs level.
> > >  	 */
> > Above comment block applies after down_read(), so it should be place
> right before device_rename().
> 
> Meh, it reads better as a single stanza, the comment obviously applies to the
> device_rname so I wouldn't re-organize the code
> 
> > > -	/* We must drop the device reference; otherwise disable_device()
> > > -	 * who is waiting for refcount to drop to zero while holding
> > > -	 * unregistration_lock won't happen. This can lead to deadlock.
> > > -	 * Also if we give up the ib_device reference here,
> > > -	 * ib_unregister_device() followed by ib_dealloc_device() can
> > > proceeed
> > > -	 * to free this device. Therefore, this thread must hold a reference
> > > -	 * to the device structure so that it cannot get freed, while this
> > > -	 * thread works on it.
> > > -	 */
> > Code is not by heart to many now and in future as much as you. :-) And
> > things done here is not intuitive either.
> > Hence I make humble request to include above comment block in
> ib_device_set_netns_put().
> 
> We have many cases of this pattern, this is not the only one. It makes no
> sense to have a *giant* comment on top of only one case.
> 
> If you find it confusing then make a function and add that text to the
> function documentation.
> 
> > > -	ret = rdma_dev_change_netns_with_put(dev, net);
> > > +	/*
> > > +	 * Currently supported only for those providers which support
> > > +	 * disassociation and don't do port specific sysfs init. Once a
> > > +	 * port_cleanup infrastructure is implemented, this limitation will be
> > > +	 * removed.
> > > +	 */
> > > +	if (!dev->ops.disassociate_ucontext || dev->ops.init_port ||
> > > +	    ib_devices_shared_netns) {
> > Reading ib_devices_shared_netns without rdma_net_rwsem is not of any
> help.
> 
> Hurm, yes something better should be done here I suppose
> 
> > It can be omitted as I explained above.  If you want to add it to
> > notify user that he shouldn't be doing it currently because it is in
> > shared mode, then yes, its fine...
> 
> Yes, that is what I think
> 
> > > +		ret = -EOPNOTSUPP;
> > > +		goto ns_err;
> > > +	}
> > > +
> > > +	get_device(&dev->dev);
> > > +	ib_device_put(dev);
> > > +	ret = rdma_dev_change_netns(dev, current->nsproxy->net_ns, net);
> > > +	put_device(&dev->dev);
> > > +
> > >  	put_net(net);
> > >  	return ret;
> > >
> >
> > Other than those minor comments, delta looks fine.
> > I like to test due to this delta once you merge the changes and before you
> send PR to Linus.
> 
> I think you should resend the patch after testing and fixing the things you
> pointed
> 
Ok. yes, I also prefer that.
Will send through Leon's queue addressing these comments.