Re: [PATCH rdma-next v1 2/3] RDMA/core: Introduce a helper function to change net namespace of rdma device

Jason Gunthorpe <jgg@xxxxxxxx> · Wed, 10 Apr 2019 16:35:50 -0300

On Wed, Apr 10, 2019 at 07:11:41PM +0000, Parav Pandit wrote:
> > > +
> > >  static void rdma_dev_exit_net(struct net *net)  {
> > >  	struct rdma_dev_net *rnet = net_generic(net, rdma_dev_net_id);
> > @@
> > > -983,6 +1013,7 @@ static void rdma_dev_exit_net(struct net *net)
> > >  	}
> > >  	up_read(&devices_rwsem);
> > >
> > > +	devices_to_default_net(net);
> > >  	xa_erase(&rdma_nets, rnet->id);
> > >  }
> > >
> > > @@ -1425,6 +1456,98 @@ void ib_unregister_device_queued(struct
> > > ib_device *ib_dev)  }  EXPORT_SYMBOL(ib_unregister_device_queued);
> > >
> > > +static int __rdma_dev_change_netns(struct ib_device *device, struct
> > > +net *net) {
> > > +	int ret;
> > > +
> > > +	/* If the refcount is zero, it means disable_device() has progressed
> > > +	 * first during unregistration phase. There is no point in changing net
> > > +	 * namespace of device which is on path of removal.
> > > +	 */
> > > +	if (refcount_read(&device->refcount) == 0)
> > > +		return -ENODEV;
> > > +
> > > +	disable_device(device);
> > > +
> > > +	/* At this point no one would be using the device,
> > > +	 * so it is safe to change the device net namespace.
> > > +	 */
> > > +	write_pnet(&device->coredev.rdma_net, net);
> > > +	/* Currently rdma devices are system wide unique. So device
> > > +	 * name is guaranteed free in new namespace. Publish new
> > > +	 * namespace at sysfs level.
> > > +	 */
> > > +	ret = device_rename(&device->dev, dev_name(&device->dev));
> > > +	if (ret) {
> > > +		dev_warn(&device->dev, "%s Couldn't rename device\n",
> > __func__);
> > > +		return ret;
> > 
> > Hurm.. Technically we are supposed to hold the devices_rwsem if we are
> > manipulating the name. Inferring that we don't have to because the refcount
> > is 0 is a little over optimizing for my tastes
> 
> So I was asking comment section in at beginning of device.c to be
> updated in internal thread or github (don't recall) when these were
> made xarray, to describe that it devices_rwsem protects rename
> operation too.  That makes it crystal clear that that rename has to
> be protected using devices_rwsem.  And that it can be added here to
> protect it without relying on refcount.

It is not so much rename, as the devices_rwsem protects the device
list and everything intrinsically unique about it. ie you can't read
or write the name without holding that lock..

> > > +int rdma_dev_change_netns_with_put(struct ib_device *dev, struct net
> > > +*net) {
> > 
> > Missing static?
> > 
> Because this function is being called in next patch.

I don't have this patch..

> > > +	 * Currently supported only for those providers which supports
> > > +	 * disassociation and those which doesn't do port specific sysfs
> > > +	 * init. Once a port_cleanup infrastructure is implemented, this
> > > +	 * limitation will be removed.
> > > +	 */
> > > +	if (!dev->ops.disassociate_ucontext || dev->ops.init_port) {
> > > +		ib_device_put(dev);
> > > +		return -EOPNOTSUPP;
> > > +	}
> > 
> > I feel like this should also be prevented if ib_devices_shared_netns??
> >
> compat devices are added in net ns other than where it currently lives.
> add_one_compat_dev() takes care to not any compat devices in in which it lives.
> So it looks fine to me where ib_devices_shared_netns flag is honored.

More of a philisophical point.. I don't want to see people running NS
aware code without disabling compat mode at all. I don't want to see
them get into some weird state where devices beyond what they expect
are showing up in their containers because the flag wasn't disabled.

> > > +	/* If net namespace was changed by the time this thread gets to it,
> > > +	 * return an error.
> > > +	 */
> > > +	if (!net_eq(current->nsproxy->net_ns,
> > > +		    read_pnet(&dev->coredev.rdma_net))) {
> > > +		ret = -ENODEV;
> > > +		goto done;
> > > +	}
> > 
> > And this needs to be inside __rdma_dev_change_netns as the other caller
> > technically needs it too. .. and if we are doing that then the lock should be
> > absorbed into __rdma_dev_change_netns as well so it is a self-consistent
> > function.
> > 
> It was here because I do not wanted to check the net while moving the device to init_net.
> But since init_net cannot be deleted, it is fine to merge it inside __rdma_dev_change_netns.
> It just confuses on why are we checking it while moving to init_net....
> A comment would help to clarify that.

We are checking atomically that the device is still in the NS that is being
deleted before moving it out. Otherwise the user could create a sub
namespace, delete the parent namespace, move the device into the sub
and trigger some race where the device escapes its namespace when it
shouldn't.

> > What about the
> > 	kobject_uevent(&dev->dev.kobj, KOBJ_REMOVE);
> > 	kobject_uevent(&dev->dev.kobj, KOBJ_ADD);
> > 
> > Stuff that netdev does? Shouldn't we do that too? KOBJ_REMOVE is part of
> > triggering userspace to let go of its processes too..
> > 
> I am not sure if there are any rdma application that would be
> listening to the broadcast netlink messages coming out of this.  And
> if someone report it, it can be added later.

We use this already with systemd. You should test it. Make sure
srp_daemon stops and starts as appropriate.

> > +/*
> > + * The caller must pass in a device that has the kref held and the
> > +refcount
> > + * released. If the device is in cur_net and still registered then it
> > +is moved
> > + * into net.
> > + */
> > +static int rdma_dev_change_netns(struct ib_device *device, struct net
> > *cur_net,
> > +				 struct net *net)
> >  {
> > +	int ret2 = -EINVAL;
> >  	int ret;
> > 
> > -	/* If the refcount is zero, it means disable_device() has progressed
> > -	 * first during unregistration phase. There is no point in changing net
> > -	 * namespace of device which is on path of removal.
> > +	mutex_lock(&device->unregistration_lock);
> > +
> > +	/*
> > +	 * If a device not under ib_device_get() or the unregistration_lock
> > +	 * the namespace can be changed, or it can be unregistered. Check
> > +	 * again under the lock.
> >  	 */
> > -	if (refcount_read(&device->refcount) == 0)
> > -		return -ENODEV;
> > +	if (!net_eq(current->nsproxy->net_ns,
> > +		    read_pnet(&device->coredev.rdma_net)) ||
> > +	    refcount_read(&device->refcount) == 0) {
> > +		ret = -ENODEV;
> > +		goto out;
> > +	}
> We should read the refcount before reading net ns of the device.

Yes, probably.

> > 
> > +	kobject_uevent(&device->dev.kobj, KOBJ_REMOVE);
> >  	disable_device(device);
> > 
> > -	/* At this point no one would be using the device,
> > -	 * so it is safe to change the device net namespace.
> > +	/*
> > +	 * At this point no one can be using the device, so it is safe to
> > +	 * change the namespace.
> >  	 */
> >  	write_pnet(&device->coredev.rdma_net, net);
> > -	/* Currently rdma devices are system wide unique. So device
> > -	 * name is guaranteed free in new namespace. Publish new
> > -	 * namespace at sysfs level.
> > +
> > +	/*
> > +	 * Currently rdma devices are system wide unique. So the device
> > name
> > +	 * is guaranteed free in the new namespace. Publish the new
> > namespace
> > +	 * at the sysfs level.
> >  	 */
> Above comment block applies after down_read(), so it should be place right before device_rename().

Meh, it reads better as a single stanza, the comment obviously applies
to the device_rname so I wouldn't re-organize the code

> > -	/* We must drop the device reference; otherwise disable_device()
> > -	 * who is waiting for refcount to drop to zero while holding
> > -	 * unregistration_lock won't happen. This can lead to deadlock.
> > -	 * Also if we give up the ib_device reference here,
> > -	 * ib_unregister_device() followed by ib_dealloc_device() can
> > proceeed
> > -	 * to free this device. Therefore, this thread must hold a reference
> > -	 * to the device structure so that it cannot get freed, while this
> > -	 * thread works on it.
> > -	 */
> Code is not by heart to many now and in future as much as you. :-)
> And things done here is not intuitive either.
> Hence I make humble request to include above comment block in ib_device_set_netns_put().

We have many cases of this pattern, this is not the only one. It makes
no sense to have a *giant* comment on top of only one case. 

If you find it confusing then make a function and add that text to the
function documentation.

> > -	ret = rdma_dev_change_netns_with_put(dev, net);
> > +	/*
> > +	 * Currently supported only for those providers which support
> > +	 * disassociation and don't do port specific sysfs init. Once a
> > +	 * port_cleanup infrastructure is implemented, this limitation will be
> > +	 * removed.
> > +	 */
> > +	if (!dev->ops.disassociate_ucontext || dev->ops.init_port ||
> > +	    ib_devices_shared_netns) {
> Reading ib_devices_shared_netns without rdma_net_rwsem is not of any help.

Hurm, yes something better should be done here I suppose

> It can be omitted as I explained above.  If you want to add it to
> notify user that he shouldn't be doing it currently because it is in
> shared mode, then yes, its fine...

Yes, that is what I think

> > +		ret = -EOPNOTSUPP;
> > +		goto ns_err;
> > +	}
> > +
> > +	get_device(&dev->dev);
> > +	ib_device_put(dev);
> > +	ret = rdma_dev_change_netns(dev, current->nsproxy->net_ns, net);
> > +	put_device(&dev->dev);
> > +
> >  	put_net(net);
> >  	return ret;
> >
> 
> Other than those minor comments, delta looks fine.
> I like to test due to this delta once you merge the changes and before you send PR to Linus.

I think you should resend the patch after testing and fixing the
things you pointed

Jason