> -----Original Message----- > From: Jason Gunthorpe <jgg@xxxxxxxx> > Sent: Wednesday, April 10, 2019 2:36 PM > To: Parav Pandit <parav@xxxxxxxxxxxx> > Cc: Leon Romanovsky <leon@xxxxxxxxxx>; Doug Ledford > <dledford@xxxxxxxxxx>; Leon Romanovsky <leonro@xxxxxxxxxxxx>; RDMA > mailing list <linux-rdma@xxxxxxxxxxxxxxx> > Subject: Re: [PATCH rdma-next v1 2/3] RDMA/core: Introduce a helper > function to change net namespace of rdma device > > On Wed, Apr 10, 2019 at 07:11:41PM +0000, Parav Pandit wrote: > > > > + > > > > static void rdma_dev_exit_net(struct net *net) { > > > > struct rdma_dev_net *rnet = net_generic(net, rdma_dev_net_id); > > > @@ > > > > -983,6 +1013,7 @@ static void rdma_dev_exit_net(struct net *net) > > > > } > > > > up_read(&devices_rwsem); > > > > > > > > + devices_to_default_net(net); > > > > xa_erase(&rdma_nets, rnet->id); > > > > } > > > > > > > > @@ -1425,6 +1456,98 @@ void ib_unregister_device_queued(struct > > > > ib_device *ib_dev) } EXPORT_SYMBOL(ib_unregister_device_queued); > > > > > > > > +static int __rdma_dev_change_netns(struct ib_device *device, > > > > +struct net *net) { > > > > + int ret; > > > > + > > > > + /* If the refcount is zero, it means disable_device() has progressed > > > > + * first during unregistration phase. There is no point in changing net > > > > + * namespace of device which is on path of removal. > > > > + */ > > > > + if (refcount_read(&device->refcount) == 0) > > > > + return -ENODEV; > > > > + > > > > + disable_device(device); > > > > + > > > > + /* At this point no one would be using the device, > > > > + * so it is safe to change the device net namespace. > > > > + */ > > > > + write_pnet(&device->coredev.rdma_net, net); > > > > + /* Currently rdma devices are system wide unique. So device > > > > + * name is guaranteed free in new namespace. Publish new > > > > + * namespace at sysfs level. > > > > + */ > > > > + ret = device_rename(&device->dev, dev_name(&device->dev)); > > > > + if (ret) { > > > > + dev_warn(&device->dev, "%s Couldn't rename device\n", > > > __func__); > > > > + return ret; > > > > > > Hurm.. Technically we are supposed to hold the devices_rwsem if we > > > are manipulating the name. Inferring that we don't have to because > > > the refcount is 0 is a little over optimizing for my tastes > > > > So I was asking comment section in at beginning of device.c to be > > updated in internal thread or github (don't recall) when these were > > made xarray, to describe that it devices_rwsem protects rename > > operation too. That makes it crystal clear that that rename has to be > > protected using devices_rwsem. And that it can be added here to > > protect it without relying on refcount. > > It is not so much rename, as the devices_rwsem protects the device list and > everything intrinsically unique about it. ie you can't read or write the name > without holding that lock.. > > > > > +int rdma_dev_change_netns_with_put(struct ib_device *dev, struct > > > > +net > > > > +*net) { > > > > > > Missing static? > > > > > Because this function is being called in next patch. > > I don't have this patch.. > > > > > + * Currently supported only for those providers which supports > > > > + * disassociation and those which doesn't do port specific sysfs > > > > + * init. Once a port_cleanup infrastructure is implemented, this > > > > + * limitation will be removed. > > > > + */ > > > > + if (!dev->ops.disassociate_ucontext || dev->ops.init_port) { > > > > + ib_device_put(dev); > > > > + return -EOPNOTSUPP; > > > > + } > > > > > > I feel like this should also be prevented if ib_devices_shared_netns?? > > > > > compat devices are added in net ns other than where it currently lives. > > add_one_compat_dev() takes care to not any compat devices in in which it > lives. > > So it looks fine to me where ib_devices_shared_netns flag is honored. > > More of a philisophical point.. I don't want to see people running NS aware > code without disabling compat mode at all. I don't want to see them get into > some weird state where devices beyond what they expect are showing up in > their containers because the flag wasn't disabled. > > > > > + /* If net namespace was changed by the time this thread gets to it, > > > > + * return an error. > > > > + */ > > > > + if (!net_eq(current->nsproxy->net_ns, > > > > + read_pnet(&dev->coredev.rdma_net))) { > > > > + ret = -ENODEV; > > > > + goto done; > > > > + } > > > > > > And this needs to be inside __rdma_dev_change_netns as the other > > > caller technically needs it too. .. and if we are doing that then > > > the lock should be absorbed into __rdma_dev_change_netns as well so > > > it is a self-consistent function. > > > > > It was here because I do not wanted to check the net while moving the > device to init_net. > > But since init_net cannot be deleted, it is fine to merge it inside > __rdma_dev_change_netns. > > It just confuses on why are we checking it while moving to init_net.... > > A comment would help to clarify that. > > We are checking atomically that the device is still in the NS that is being > deleted before moving it out. Otherwise the user could create a sub > namespace, delete the parent namespace, move the device into the sub and > trigger some race where the device escapes its namespace when it shouldn't. > > > > What about the > > > kobject_uevent(&dev->dev.kobj, KOBJ_REMOVE); > > > kobject_uevent(&dev->dev.kobj, KOBJ_ADD); > > > > > > Stuff that netdev does? Shouldn't we do that too? KOBJ_REMOVE is > > > part of triggering userspace to let go of its processes too.. > > > > > I am not sure if there are any rdma application that would be > > listening to the broadcast netlink messages coming out of this. And > > if someone report it, it can be added later. > > We use this already with systemd. You should test it. Make sure srp_daemon > stops and starts as appropriate. > > > > +/* > > > + * The caller must pass in a device that has the kref held and the > > > +refcount > > > + * released. If the device is in cur_net and still registered then > > > +it is moved > > > + * into net. > > > + */ > > > +static int rdma_dev_change_netns(struct ib_device *device, struct > > > +net > > > *cur_net, > > > + struct net *net) > > > { > > > + int ret2 = -EINVAL; > > > int ret; > > > > > > - /* If the refcount is zero, it means disable_device() has progressed > > > - * first during unregistration phase. There is no point in changing net > > > - * namespace of device which is on path of removal. > > > + mutex_lock(&device->unregistration_lock); > > > + > > > + /* > > > + * If a device not under ib_device_get() or the unregistration_lock > > > + * the namespace can be changed, or it can be unregistered. Check > > > + * again under the lock. > > > */ > > > - if (refcount_read(&device->refcount) == 0) > > > - return -ENODEV; > > > + if (!net_eq(current->nsproxy->net_ns, > > > + read_pnet(&device->coredev.rdma_net)) || > > > + refcount_read(&device->refcount) == 0) { > > > + ret = -ENODEV; > > > + goto out; > > > + } > > We should read the refcount before reading net ns of the device. > > Yes, probably. > > > > > > > + kobject_uevent(&device->dev.kobj, KOBJ_REMOVE); > > > disable_device(device); > > > > > > - /* At this point no one would be using the device, > > > - * so it is safe to change the device net namespace. > > > + /* > > > + * At this point no one can be using the device, so it is safe to > > > + * change the namespace. > > > */ > > > write_pnet(&device->coredev.rdma_net, net); > > > - /* Currently rdma devices are system wide unique. So device > > > - * name is guaranteed free in new namespace. Publish new > > > - * namespace at sysfs level. > > > + > > > + /* > > > + * Currently rdma devices are system wide unique. So the device > > > name > > > + * is guaranteed free in the new namespace. Publish the new > > > namespace > > > + * at the sysfs level. > > > */ > > Above comment block applies after down_read(), so it should be place > right before device_rename(). > > Meh, it reads better as a single stanza, the comment obviously applies to the > device_rname so I wouldn't re-organize the code > > > > - /* We must drop the device reference; otherwise disable_device() > > > - * who is waiting for refcount to drop to zero while holding > > > - * unregistration_lock won't happen. This can lead to deadlock. > > > - * Also if we give up the ib_device reference here, > > > - * ib_unregister_device() followed by ib_dealloc_device() can > > > proceeed > > > - * to free this device. Therefore, this thread must hold a reference > > > - * to the device structure so that it cannot get freed, while this > > > - * thread works on it. > > > - */ > > Code is not by heart to many now and in future as much as you. :-) And > > things done here is not intuitive either. > > Hence I make humble request to include above comment block in > ib_device_set_netns_put(). > > We have many cases of this pattern, this is not the only one. It makes no > sense to have a *giant* comment on top of only one case. > > If you find it confusing then make a function and add that text to the > function documentation. > > > > - ret = rdma_dev_change_netns_with_put(dev, net); > > > + /* > > > + * Currently supported only for those providers which support > > > + * disassociation and don't do port specific sysfs init. Once a > > > + * port_cleanup infrastructure is implemented, this limitation will be > > > + * removed. > > > + */ > > > + if (!dev->ops.disassociate_ucontext || dev->ops.init_port || > > > + ib_devices_shared_netns) { > > Reading ib_devices_shared_netns without rdma_net_rwsem is not of any > help. > > Hurm, yes something better should be done here I suppose > > > It can be omitted as I explained above. If you want to add it to > > notify user that he shouldn't be doing it currently because it is in > > shared mode, then yes, its fine... > > Yes, that is what I think > > > > + ret = -EOPNOTSUPP; > > > + goto ns_err; > > > + } > > > + > > > + get_device(&dev->dev); > > > + ib_device_put(dev); > > > + ret = rdma_dev_change_netns(dev, current->nsproxy->net_ns, net); > > > + put_device(&dev->dev); > > > + > > > put_net(net); > > > return ret; > > > > > > > Other than those minor comments, delta looks fine. > > I like to test due to this delta once you merge the changes and before you > send PR to Linus. > > I think you should resend the patch after testing and fixing the things you > pointed > Ok. yes, I also prefer that. Will send through Leon's queue addressing these comments.