RDMA subsystem namespace related questions (was Re: Finding the namespace of a struct ib_device)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 9/7/20 9:48 PM, Ka-Cheong Poon wrote:

This may require a number of changes and the way a client interacts with
the current RDMA framework.  For example, currently a client registers
once using one struct ib_client and gets device notifications for all
namespaces and devices.  Suppose there is rdma_[un]register_net_client(),
it may need to require a client to use a different struct ib_client to
register for each net namespace.  And struct ib_client probably needs to
have a field to store the net namespace.  Probably all those client
interaction functions will need to be modified.  Since the clients xarray
is global, more clients may mean performance implication, such as it takes
longer to go through the whole clients xarray.

There are probably many other subtle changes required.  It may turn out to
be not so straight forward.  Is this community willing the take such changes?
I can take a stab at it if the community really thinks that this is preferred.


Attached is a diff of a prototype for the above.  This exercise is
to see what needs to be done to have a more network namespace aware
interface for RDMA client registration.

Currently, there are ib_[un]register_client().  Under the RDMA namespace
exclusive mode, all RDMA devices are assigned to the init_net namespace
initially.  A kernel module uses this interface to register with the RDMA
subsystem.  When a device is assigned to a namespace, the client's
registered remove upcall is called with the device as the parameter (this
is removing from the init_net namespace).  Then the client's add upcall
is called with the device as the parameter (this is assigning to the new
namespace).  When that namespace is removed (*), a similar sequence of
events happen, a remove upcall (removing from the namespace) is followed
by add upcall (assigning back to the init_net namespace).  All the RDMA
clients are stored in a global struct xarray called clients (in device.c)
and each client is assigned a client ID.

This exercise adds the rdma_[un]register_net_client() for those clients
which want to have more separation between different namespaces.  This
interface takes a struct net parameter.  A kernel module uses this to
indicate that it is only interested in the RDMA events related to the
given network namespace.  Suppose a client uses init_net as the parameter.
In the above example when a device is assigned to a namespace, only the
client's remove upcall is called (removing from the init_net namespace).
The add upcall is not followed.  Then when the namespace is removed, the
client's add upcall is called (re-assigning back to the init_net namespace).
Suppose a client uses a specific namespace as the parameter.  When a device
is assigned to that specific namespace, the client's add upcall is called.
When the client unregisters with RDMA (or when the namespace is going away),
the client's remove upcall is called.  The RDMA clients are stored in each
namespace's struct rdma_dev_net and each client is assigned a client ID
in that namespace (this means that it is unique only in that namespace but
not unique globally among all namespaces).

This seemingly simple exercise turned out to be not so simple because of
the need to keep the existing interface with the existing behavior.  So only
when a client uses the new interface, the behavior is changed to what is
described above.  There should be no change of behavior to any existing
RDMA client.  There are several obstacles to overcome for this change.  One
difficulty is the global client ID since a lot of code rely on this ID as an
index the both the global clients xarray and individual device's client_data
xarray.  Detailed changes are in the attached diff if folks are interested.

Note that the new interface has one obvious issue, it does not make much sense
in RDMA shared network namespace mode.  In the shared mode, all devices are
associated with init_net.  So if a client uses the new interface to register
a specific namespace other than init_net, it will never get any upcall.  This
and the difficulties in adding a seemingly simple interface makes me wonder
about the following questions.

Is the RDMA shared namespace mode the preferred mode to use as it is the
default mode?  Is it expected that a client knows the running mode before
interacting with the RDMA subsystem?  Is a client not supposed to differentiate
different namespaces?  Besides the current add client upcall, another example
related to this is about event handling.  Suppose a client calls rdma_create_id()
to create listeners in different namespaces but with the same event handler.
A new connection comes in and the event handler is called for an
RDMA_CM_EVENT_CONNECT_REQUEST event.  There is no obvious namespace info regarding
the event.  It seems that the only way to find out the namespace info is to
use the context of struct rdma_cm_id.  The client must somehow add the namespace
info to the context since the subsystem does not provide any help.  Is this the
assumed solution?  BTW, this exercise still does not remove the need to have
rdma_dev_to_netns() as the add upcall does not provide any namespace info.  Given
all these questions, the rdma_[un]register_net_client() do not seem to fit in
the current way in interacting with the RDMA subsystem unfortunately.

Thanks.


(*) Note that in __rdma_create_id(), it does a get_net(net) to put a
    reference on a namespace.  Suppose a kernel module calls rdma_create_id()
    in its namespace .init function to create an RDMA listener and calls
    rdma_destroy_id() in its namespace .exit function to destroy it.  Since
    __rdma_create_id() adds a reference to a namespace, when a sys admin
    deletes a namespace (say `ip netns del ...`), the namespace won't be
    deleted because of this reference.  And the module will not release this
    reference until its .exit function is called only when the namespace is
    deleted.  To resolve this issue, in the diff (in __rdma_create_id()), I
    did something similar to the kern check in sk_alloc().



--
K. Poon
ka-cheong.poon@xxxxxxxxxx


diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 7f0e91e92968..15eb91eee200 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -873,7 +873,10 @@ struct rdma_cm_id *__rdma_create_id(struct net *net,
 	INIT_LIST_HEAD(&id_priv->listen_list);
 	INIT_LIST_HEAD(&id_priv->mc_list);
 	get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num);
-	id_priv->id.route.addr.dev_addr.net = get_net(net);
+	if (caller)
+		id_priv->id.route.addr.dev_addr.net = net;
+	else
+		id_priv->id.route.addr.dev_addr.net = get_net(net);
 	id_priv->seq_num &= 0x00ffffff;
 
 	return &id_priv->id;
@@ -1819,8 +1822,12 @@ static void cma_leave_mc_groups(struct rdma_id_private *id_priv)
 static void _destroy_id(struct rdma_id_private *id_priv,
 			enum rdma_cm_state state)
 {
+	bool rel_net = true;
+
 	cma_cancel_operation(id_priv, state);
 
+	if (id_priv->res.kern_name)
+		rel_net = false;
 	rdma_restrack_del(&id_priv->res);
 	if (id_priv->cma_dev) {
 		if (rdma_cap_ib_cm(id_priv->id.device, 1)) {
@@ -1846,7 +1853,8 @@ static void _destroy_id(struct rdma_id_private *id_priv,
 	if (id_priv->id.route.addr.dev_addr.sgid_attr)
 		rdma_put_gid_attr(id_priv->id.route.addr.dev_addr.sgid_attr);
 
-	put_net(id_priv->id.route.addr.dev_addr.net);
+	if (rel_net)
+		put_net(id_priv->id.route.addr.dev_addr.net);
 	kfree(id_priv);
 }
 
diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index a1e6a67b2c4a..3c6c3cd516f3 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -66,6 +66,11 @@ struct rdma_dev_net {
 	struct sock *nl_sock;
 	possible_net_t net;
 	u32 id;
+
+	u32			rdn_highest_client_id;
+	struct xarray		rdn_clients;
+	struct rw_semaphore	rdn_clients_rwsem;
+
 };
 
 extern const struct attribute_group ib_dev_attr_group;
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index c36b4d2b61e0..f113c9b2e547 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -93,10 +93,7 @@ static DEFINE_XARRAY_FLAGS(devices, XA_FLAGS_ALLOC);
 static DECLARE_RWSEM(devices_rwsem);
 #define DEVICE_REGISTERED XA_MARK_1
 
-static u32 highest_client_id;
 #define CLIENT_REGISTERED XA_MARK_1
-static DEFINE_XARRAY_FLAGS(clients, XA_FLAGS_ALLOC);
-static DECLARE_RWSEM(clients_rwsem);
 
 static void ib_client_put(struct ib_client *client)
 {
@@ -399,6 +396,7 @@ static int rename_compat_devs(struct ib_device *device)
 
 int ib_device_rename(struct ib_device *ibdev, const char *name)
 {
+	struct rdma_dev_net *rdn;
 	unsigned long index;
 	void *client_data;
 	int ret;
@@ -424,10 +422,12 @@ int ib_device_rename(struct ib_device *ibdev, const char *name)
 	ret = rename_compat_devs(ibdev);
 
 	downgrade_write(&devices_rwsem);
+	rdn = rdma_net_to_dev_net(read_pnet(&ibdev->coredev.rdma_net));
+
 	down_read(&ibdev->client_data_rwsem);
 	xan_for_each_marked(&ibdev->client_data, index, client_data,
 			    CLIENT_DATA_REGISTERED) {
-		struct ib_client *client = xa_load(&clients, index);
+		struct ib_client *client = xa_load(&rdn->rdn_clients, index);
 
 		if (!client || !client->rename)
 			continue;
@@ -504,6 +504,7 @@ static void ib_device_release(struct device *device)
 
 	xa_destroy(&dev->compat_devs);
 	xa_destroy(&dev->client_data);
+	xa_destroy(&dev->net_client_data);
 	kfree_rcu(dev, rcu_head);
 }
 
@@ -594,6 +595,7 @@ struct ib_device *_ib_alloc_device(size_t size)
 	 * destroyed if the user stores NULL in the client data.
 	 */
 	xa_init_flags(&device->client_data, XA_FLAGS_ALLOC);
+	xa_init_flags(&device->net_client_data, XA_FLAGS_ALLOC);
 	init_rwsem(&device->client_data_rwsem);
 	xa_init_flags(&device->compat_devs, XA_FLAGS_ALLOC);
 	mutex_init(&device->compat_devs_mutex);
@@ -631,6 +633,7 @@ void ib_dealloc_device(struct ib_device *device)
 
 	WARN_ON(!xa_empty(&device->compat_devs));
 	WARN_ON(!xa_empty(&device->client_data));
+	WARN_ON(!xa_empty(&device->net_client_data));
 	WARN_ON(refcount_read(&device->refcount));
 	rdma_restrack_clean(device);
 	/* Balances with device_initialize */
@@ -647,8 +650,9 @@ EXPORT_SYMBOL(ib_dealloc_device);
  * or remove is fully completed.
  */
 static int add_client_context(struct ib_device *device,
-			      struct ib_client *client)
+			      struct ib_client *client, bool net_client)
 {
+	struct xarray *cl_data;
 	int ret = 0;
 
 	if (!device->kverbs_provider && !client->no_kverbs_req)
@@ -663,16 +667,20 @@ static int add_client_context(struct ib_device *device,
 		goto out_unlock;
 	refcount_inc(&device->refcount);
 
+	if (net_client)
+		cl_data = &device->net_client_data;
+	else
+		cl_data = &device->client_data;
+
 	/*
 	 * Another caller to add_client_context got here first and has already
 	 * completely initialized context.
 	 */
-	if (xa_get_mark(&device->client_data, client->client_id,
+	if (xa_get_mark(cl_data, client->client_id,
 		    CLIENT_DATA_REGISTERED))
 		goto out;
 
-	ret = xa_err(xa_store(&device->client_data, client->client_id, NULL,
-			      GFP_KERNEL));
+	ret = xa_err(xa_store(cl_data, client->client_id, NULL, GFP_KERNEL));
 	if (ret)
 		goto out;
 	downgrade_write(&device->client_data_rwsem);
@@ -692,8 +700,7 @@ static int add_client_context(struct ib_device *device,
 	}
 
 	/* Readers shall not see a client until add has been completed */
-	xa_set_mark(&device->client_data, client->client_id,
-		    CLIENT_DATA_REGISTERED);
+	xa_set_mark(cl_data, client->client_id, CLIENT_DATA_REGISTERED);
 	up_read(&device->client_data_rwsem);
 	return 0;
 
@@ -706,20 +713,26 @@ static int add_client_context(struct ib_device *device,
 }
 
 static void remove_client_context(struct ib_device *device,
-				  unsigned int client_id)
+				  unsigned int client_id,
+				  struct rdma_dev_net *rdn, bool net_client)
 {
 	struct ib_client *client;
+	struct xarray *cl_data;
 	void *client_data;
 
+	if (net_client)
+		cl_data = &device->net_client_data;
+	else
+		cl_data = &device->client_data;
+
 	down_write(&device->client_data_rwsem);
-	if (!xa_get_mark(&device->client_data, client_id,
-			 CLIENT_DATA_REGISTERED)) {
+	if (!xa_get_mark(cl_data, client_id, CLIENT_DATA_REGISTERED)) {
 		up_write(&device->client_data_rwsem);
 		return;
 	}
-	client_data = xa_load(&device->client_data, client_id);
-	xa_clear_mark(&device->client_data, client_id, CLIENT_DATA_REGISTERED);
-	client = xa_load(&clients, client_id);
+	client_data = xa_load(cl_data, client_id);
+	xa_clear_mark(cl_data, client_id, CLIENT_DATA_REGISTERED);
+	client = xa_load(&rdn->rdn_clients, client_id);
 	up_write(&device->client_data_rwsem);
 
 	/*
@@ -734,7 +747,10 @@ static void remove_client_context(struct ib_device *device,
 	if (client->remove)
 		client->remove(device, client_data);
 
-	xa_erase(&device->client_data, client_id);
+	if (client->net_client)
+		xa_erase(&device->net_client_data, client_id);
+	else
+		xa_erase(&device->client_data, client_id);
 	ib_device_put(device);
 	ib_client_put(client);
 }
@@ -924,6 +940,7 @@ static int add_one_compat_dev(struct ib_device *device,
 		goto insert_err;
 
 	mutex_unlock(&device->compat_devs_mutex);
+
 	return 0;
 
 insert_err:
@@ -1099,6 +1116,9 @@ static void rdma_dev_exit_net(struct net *net)
 
 	rdma_nl_net_exit(rnet);
 	xa_erase(&rdma_nets, rnet->id);
+
+	WARN_ON(!xa_empty(&rnet->rdn_clients));
+	xa_destroy(&rnet->rdn_clients);
 }
 
 static __net_init int rdma_dev_init_net(struct net *net)
@@ -1114,6 +1134,9 @@ static __net_init int rdma_dev_init_net(struct net *net)
 	if (ret)
 		return ret;
 
+	xa_init_flags(&rnet->rdn_clients, XA_FLAGS_ALLOC);
+	init_rwsem(&rnet->rdn_clients_rwsem);
+
 	/* No need to create any compat devices in default init_net. */
 	if (net_eq(net, &init_net))
 		return 0;
@@ -1263,9 +1286,14 @@ static int setup_device(struct ib_device *device)
 
 static void disable_device(struct ib_device *device)
 {
+	struct rdma_dev_net *init_rdn, *rdn;
+	struct net *net;
 	u32 cid;
 
 	WARN_ON(!refcount_read(&device->refcount));
+	init_rdn = rdma_net_to_dev_net(&init_net);
+	net = read_pnet(&device->coredev.rdma_net);
+	rdn = rdma_net_to_dev_net(net);
 
 	down_write(&devices_rwsem);
 	xa_clear_mark(&devices, device->index, DEVICE_REGISTERED);
@@ -1277,12 +1305,21 @@ static void disable_device(struct ib_device *device)
 	 * clients can be added to this ib_device past this point we only need
 	 * the maximum possible client_id value here.
 	 */
-	down_read(&clients_rwsem);
-	cid = highest_client_id;
-	up_read(&clients_rwsem);
+	down_read(&init_rdn->rdn_clients_rwsem);
+	cid = init_rdn->rdn_highest_client_id;
+	up_read(&init_rdn->rdn_clients_rwsem);
 	while (cid) {
 		cid--;
-		remove_client_context(device, cid);
+		remove_client_context(device, cid, init_rdn, false);
+	}
+
+	rdn = rdma_net_to_dev_net(net);
+	down_read(&rdn->rdn_clients_rwsem);
+	cid = rdn->rdn_highest_client_id;
+	up_read(&rdn->rdn_clients_rwsem);
+	while (cid) {
+		cid--;
+		remove_client_context(device, cid, rdn, true);
 	}
 
 	/* Pairs with refcount_set in enable_device */
@@ -1297,6 +1334,26 @@ static void disable_device(struct ib_device *device)
 	remove_compat_devs(device);
 }
 
+static int add_net_client_context(struct rdma_dev_net *rdn,
+				  struct ib_device *device, bool net_client)
+{
+	struct ib_client *client;
+	unsigned long index;
+	int ret = 0;
+
+	down_read(&rdn->rdn_clients_rwsem);
+	xa_for_each_marked(&rdn->rdn_clients, index, client,
+			   CLIENT_REGISTERED) {
+		if (client->net_client == net_client)
+			ret = add_client_context(device, client, net_client);
+		if (ret)
+			break;
+	}
+	up_read(&rdn->rdn_clients_rwsem);
+
+	return ret;
+}
+
 /*
  * An enabled device is visible to all clients and to all the public facing
  * APIs that return a device pointer. This always returns with a new get, even
@@ -1304,8 +1361,8 @@ static void disable_device(struct ib_device *device)
  */
 static int enable_device_and_get(struct ib_device *device)
 {
-	struct ib_client *client;
-	unsigned long index;
+	struct rdma_dev_net *rdn;
+	struct net *net;
 	int ret = 0;
 
 	/*
@@ -1321,20 +1378,27 @@ static int enable_device_and_get(struct ib_device *device)
 	 * DEVICE_REGISTERED while we are completing the client setup.
 	 */
 	downgrade_write(&devices_rwsem);
-
 	if (device->ops.enable_driver) {
 		ret = device->ops.enable_driver(device);
 		if (ret)
 			goto out;
 	}
 
-	down_read(&clients_rwsem);
-	xa_for_each_marked (&clients, index, client, CLIENT_REGISTERED) {
-		ret = add_client_context(device, client);
-		if (ret)
-			break;
-	}
-	up_read(&clients_rwsem);
+	/* For backward compatibility, always add client context for all "old"
+	 * registered clients using ib_register_client().
+	 */
+	rdn = rdma_net_to_dev_net(&init_net);
+	ret = add_net_client_context(rdn, device, false);
+	if (ret)
+		goto out;
+
+	/* Now add client context for clients registered using
+	 * rdma_register_net_client().
+	 */
+	net = read_pnet(&device->coredev.rdma_net);
+	rdn = rdma_net_to_dev_net(net);
+	ret = add_net_client_context(rdn, device, true);
+
 	if (!ret)
 		ret = add_compat_devs(device);
 out:
@@ -1711,37 +1775,49 @@ static struct pernet_operations rdma_dev_net_ops = {
 	.size = sizeof(struct rdma_dev_net),
 };
 
-static int assign_client_id(struct ib_client *client)
+static int assign_client_id(struct net *net, struct ib_client *client,
+			    bool net_client)
 {
+	struct rdma_dev_net *rdn;
 	int ret;
 
-	down_write(&clients_rwsem);
+	rdn = rdma_net_to_dev_net(net);
+
+	down_write(&rdn->rdn_clients_rwsem);
+
 	/*
 	 * The add/remove callbacks must be called in FIFO/LIFO order. To
 	 * achieve this we assign client_ids so they are sorted in
 	 * registration order.
 	 */
-	client->client_id = highest_client_id;
-	ret = xa_insert(&clients, client->client_id, client, GFP_KERNEL);
+	client->client_id = rdn->rdn_highest_client_id;
+	ret = xa_insert(&rdn->rdn_clients, client->client_id, client,
+			GFP_KERNEL);
 	if (ret)
 		goto out;
 
-	highest_client_id++;
-	xa_set_mark(&clients, client->client_id, CLIENT_REGISTERED);
+	rdn->rdn_highest_client_id++;
+	xa_set_mark(&rdn->rdn_clients, client->client_id, CLIENT_REGISTERED);
+	client->net_client = net_client;
 
 out:
-	up_write(&clients_rwsem);
+	up_write(&rdn->rdn_clients_rwsem);
 	return ret;
 }
 
-static void remove_client_id(struct ib_client *client)
+static void remove_client_id(struct net *net, struct ib_client *client)
 {
-	down_write(&clients_rwsem);
-	xa_erase(&clients, client->client_id);
-	for (; highest_client_id; highest_client_id--)
-		if (xa_load(&clients, highest_client_id - 1))
+	struct rdma_dev_net *rdn;
+	struct xarray *clients;
+
+	rdn = rdma_net_to_dev_net(net);
+	clients = &rdn->rdn_clients;
+	down_write(&rdn->rdn_clients_rwsem);
+	xa_erase(clients, client->client_id);
+	for (; rdn->rdn_highest_client_id; rdn->rdn_highest_client_id--)
+		if (xa_load(clients, rdn->rdn_highest_client_id - 1))
 			break;
-	up_write(&clients_rwsem);
+	up_write(&rdn->rdn_clients_rwsem);
 }
 
 /**
@@ -1765,13 +1841,13 @@ int ib_register_client(struct ib_client *client)
 
 	refcount_set(&client->uses, 1);
 	init_completion(&client->uses_zero);
-	ret = assign_client_id(client);
+	ret = assign_client_id(&init_net, client, false);
 	if (ret)
 		return ret;
 
 	down_read(&devices_rwsem);
 	xa_for_each_marked (&devices, index, device, DEVICE_REGISTERED) {
-		ret = add_client_context(device, client);
+		ret = add_client_context(device, client, false);
 		if (ret) {
 			up_read(&devices_rwsem);
 			ib_unregister_client(client);
@@ -1783,6 +1859,34 @@ int ib_register_client(struct ib_client *client)
 }
 EXPORT_SYMBOL(ib_register_client);
 
+int rdma_register_net_client(struct net *net, struct ib_client *client)
+{
+	struct ib_device *device;
+	unsigned long index;
+	int ret;
+
+	refcount_set(&client->uses, 1);
+	init_completion(&client->uses_zero);
+	ret = assign_client_id(net, client, true);
+	if (ret)
+		return ret;
+
+	down_read(&devices_rwsem);
+	xa_for_each_marked (&devices, index, device, DEVICE_REGISTERED) {
+		if (!net_eq(net, read_pnet(&device->coredev.rdma_net)))
+			continue;
+		ret = add_client_context(device, client, true);
+		if (ret) {
+			up_read(&devices_rwsem);
+			rdma_unregister_net_client(net, client);
+			return ret;
+		}
+	}
+	up_read(&devices_rwsem);
+	return 0;
+}
+EXPORT_SYMBOL(rdma_register_net_client);
+
 /**
  * ib_unregister_client - Unregister an IB client
  * @client:Client to unregister
@@ -1797,12 +1901,14 @@ EXPORT_SYMBOL(ib_register_client);
 void ib_unregister_client(struct ib_client *client)
 {
 	struct ib_device *device;
+	struct rdma_dev_net *rdn;
 	unsigned long index;
 
-	down_write(&clients_rwsem);
+	rdn = rdma_net_to_dev_net(&init_net);
+	down_write(&rdn->rdn_clients_rwsem);
 	ib_client_put(client);
-	xa_clear_mark(&clients, client->client_id, CLIENT_REGISTERED);
-	up_write(&clients_rwsem);
+	xa_clear_mark(&rdn->rdn_clients, client->client_id, CLIENT_REGISTERED);
+	up_write(&rdn->rdn_clients_rwsem);
 
 	/* We do not want to have locks while calling client->remove() */
 	rcu_read_lock();
@@ -1811,7 +1917,7 @@ void ib_unregister_client(struct ib_client *client)
 			continue;
 		rcu_read_unlock();
 
-		remove_client_context(device, client->client_id);
+		remove_client_context(device, client->client_id, rdn, false);
 
 		ib_device_put(device);
 		rcu_read_lock();
@@ -1823,19 +1929,58 @@ void ib_unregister_client(struct ib_client *client)
 	 * removal is ongoing. Wait until all removals are completed.
 	 */
 	wait_for_completion(&client->uses_zero);
-	remove_client_id(client);
+	remove_client_id(&init_net, client);
 }
 EXPORT_SYMBOL(ib_unregister_client);
 
+void rdma_unregister_net_client(struct net *net, struct ib_client *client)
+{
+	struct ib_device *device;
+	struct rdma_dev_net *rdn;
+	unsigned long index;
+
+	rdn = rdma_net_to_dev_net(net);
+	down_write(&rdn->rdn_clients_rwsem);
+	ib_client_put(client);
+	xa_clear_mark(&rdn->rdn_clients, client->client_id, CLIENT_REGISTERED);
+	up_write(&rdn->rdn_clients_rwsem);
+
+	/* We do not want to have locks while calling client->remove() */
+	rcu_read_lock();
+	xa_for_each (&devices, index, device) {
+		if (!ib_device_try_get(device))
+			continue;
+		rcu_read_unlock();
+
+		remove_client_context(device, client->client_id, rdn, true);
+
+		ib_device_put(device);
+		rcu_read_lock();
+	}
+	rcu_read_unlock();
+
+	/*
+	 * remove_client_context() is not a fence, it can return even though a
+	 * removal is ongoing. Wait until all removals are completed.
+	 */
+	wait_for_completion(&client->uses_zero);
+	remove_client_id(net, client);
+}
+EXPORT_SYMBOL(rdma_unregister_net_client);
+
 static int __ib_get_global_client_nl_info(const char *client_name,
 					  struct ib_client_nl_info *res)
 {
 	struct ib_client *client;
+	struct rdma_dev_net *rdn;
 	unsigned long index;
 	int ret = -ENOENT;
 
-	down_read(&clients_rwsem);
-	xa_for_each_marked (&clients, index, client, CLIENT_REGISTERED) {
+	/* No network namespace info available... */
+	rdn = rdma_net_to_dev_net(&init_net);
+	down_read(&rdn->rdn_clients_rwsem);
+	xa_for_each_marked (&rdn->rdn_clients, index, client,
+			    CLIENT_REGISTERED) {
 		if (strcmp(client->name, client_name) != 0)
 			continue;
 		if (!client->get_global_nl_info) {
@@ -1849,7 +1994,7 @@ static int __ib_get_global_client_nl_info(const char *client_name,
 			get_device(res->cdev);
 		break;
 	}
-	up_read(&clients_rwsem);
+	up_read(&rdn->rdn_clients_rwsem);
 	return ret;
 }
 
@@ -1857,14 +2002,24 @@ static int __ib_get_client_nl_info(struct ib_device *ibdev,
 				   const char *client_name,
 				   struct ib_client_nl_info *res)
 {
+	struct xarray *cl_data, *cls;
+	struct rdma_dev_net *rdn;
 	unsigned long index;
 	void *client_data;
 	int ret = -ENOENT;
 
 	down_read(&ibdev->client_data_rwsem);
-	xan_for_each_marked (&ibdev->client_data, index, client_data,
+	if (ib_devices_shared_netns) {
+		rdn = rdma_net_to_dev_net(&init_net);
+		cl_data = &ibdev->client_data;
+	} else {
+		rdn = rdma_net_to_dev_net(read_pnet(&ibdev->coredev.rdma_net));
+		cl_data = &ibdev->net_client_data;
+	}
+	cls = &rdn->rdn_clients;
+	xan_for_each_marked (cl_data, index, client_data,
 			     CLIENT_DATA_REGISTERED) {
-		struct ib_client *client = xa_load(&clients, index);
+		struct ib_client *client = xa_load(cls, index);
 
 		if (!client || strcmp(client->name, client_name) != 0)
 			continue;
@@ -1939,13 +2094,17 @@ int ib_get_client_nl_info(struct ib_device *ibdev, const char *client_name,
 void ib_set_client_data(struct ib_device *device, struct ib_client *client,
 			void *data)
 {
+	struct xarray *cl_data;
 	void *rc;
 
 	if (WARN_ON(IS_ERR(data)))
 		data = NULL;
 
-	rc = xa_store(&device->client_data, client->client_id, data,
-		      GFP_KERNEL);
+	if (client->net_client)
+		cl_data = &device->net_client_data;
+	else
+		cl_data = &device->client_data;
+	rc = xa_store(cl_data, client->client_id, data, GFP_KERNEL);
 	WARN_ON(xa_is_err(rc));
 }
 EXPORT_SYMBOL(ib_set_client_data);
@@ -2523,20 +2682,27 @@ struct net_device *ib_get_net_dev_by_params(struct ib_device *dev,
 					    const struct sockaddr *addr)
 {
 	struct net_device *net_dev = NULL;
+	struct rdma_dev_net *init_rdn, *rdn;
 	unsigned long index;
 	void *client_data;
 
 	if (!rdma_protocol_ib(dev, port))
 		return NULL;
 
+	init_rdn = rdma_net_to_dev_net(&init_net);
+	rdn = rdma_net_to_dev_net(read_pnet(&dev->coredev.rdma_net));
 	/*
 	 * Holding the read side guarantees that the client will not become
 	 * unregistered while we are calling get_net_dev_by_params()
 	 */
 	down_read(&dev->client_data_rwsem);
+	/* First try all the non-net registered clients, and then the net
+	 * registered clients.
+	 */
 	xan_for_each_marked (&dev->client_data, index, client_data,
 			     CLIENT_DATA_REGISTERED) {
-		struct ib_client *client = xa_load(&clients, index);
+		struct ib_client *client = xa_load(&init_rdn->rdn_clients,
+						   index);
 
 		if (!client || !client->get_net_dev_by_params)
 			continue;
@@ -2546,6 +2712,22 @@ struct net_device *ib_get_net_dev_by_params(struct ib_device *dev,
 		if (net_dev)
 			break;
 	}
+	if (!net_dev) {
+		xan_for_each_marked(&dev->net_client_data, index, client_data,
+				     CLIENT_DATA_REGISTERED) {
+			struct ib_client *client = xa_load(&rdn->rdn_clients,
+							   index);
+
+			if (!client || !client->get_net_dev_by_params)
+				continue;
+
+			net_dev = client->get_net_dev_by_params(dev, port,
+								pkey, gid, addr,
+								client_data);
+			if (net_dev)
+				break;
+		}
+	}
 	up_read(&dev->client_data_rwsem);
 
 	return net_dev;
@@ -2749,6 +2931,12 @@ static int __init ib_core_init(void)
 
 	rdma_nl_init();
 
+	ret = register_pernet_device(&rdma_dev_net_ops);
+	if (ret) {
+		pr_warn("Couldn't init compat dev. ret %d\n", ret);
+		goto err_compat;
+	}
+
 	ret = addr_init();
 	if (ret) {
 		pr_warn("Couldn't init IB address resolution\n");
@@ -2773,12 +2961,6 @@ static int __init ib_core_init(void)
 		goto err_sa;
 	}
 
-	ret = register_pernet_device(&rdma_dev_net_ops);
-	if (ret) {
-		pr_warn("Couldn't init compat dev. ret %d\n", ret);
-		goto err_compat;
-	}
-
 	nldev_init();
 	rdma_nl_register(RDMA_NL_LS, ibnl_ls_cb_table);
 	roce_gid_mgmt_init();
@@ -2809,11 +2991,11 @@ static void __exit ib_core_cleanup(void)
 	roce_gid_mgmt_cleanup();
 	nldev_exit();
 	rdma_nl_unregister(RDMA_NL_LS);
-	unregister_pernet_device(&rdma_dev_net_ops);
 	unregister_blocking_lsm_notifier(&ibdev_lsm_nb);
 	ib_sa_cleanup();
 	ib_mad_cleanup();
 	addr_cleanup();
+	unregister_pernet_device(&rdma_dev_net_ops);
 	rdma_nl_exit();
 	class_unregister(&ib_class);
 	destroy_workqueue(ib_comp_unbound_wq);
@@ -2821,7 +3003,6 @@ static void __exit ib_core_cleanup(void)
 	/* Make sure that any pending umem accounting work is done. */
 	destroy_workqueue(ib_wq);
 	flush_workqueue(system_unbound_wq);
-	WARN_ON(!xa_empty(&clients));
 	WARN_ON(!xa_empty(&devices));
 }
 
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index c0b2fa7e9b95..1f3f497a870a 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2729,6 +2729,9 @@ struct ib_device {
 	char iw_ifname[IFNAMSIZ];
 	u32 iw_driver_flags;
 	u32 lag_flags;
+
+	/* Also protected by client_data_rwsem */
+	struct xarray			net_client_data;
 };
 
 struct ib_client_nl_info;
@@ -2770,6 +2773,7 @@ struct ib_client {
 
 	/* kverbs are not required by the client */
 	u8 no_kverbs_req:1;
+	u8 net_client:1;
 };
 
 /*
@@ -2807,6 +2811,9 @@ void ib_unregister_device_queued(struct ib_device *ib_dev);
 int ib_register_client   (struct ib_client *client);
 void ib_unregister_client(struct ib_client *client);
 
+int rdma_register_net_client(struct net *net, struct ib_client *client);
+void rdma_unregister_net_client(struct net *net, struct ib_client *client);
+
 void __rdma_block_iter_start(struct ib_block_iter *biter,
 			     struct scatterlist *sglist,
 			     unsigned int nents,
@@ -2852,7 +2859,10 @@ rdma_block_iter_dma_address(struct ib_block_iter *biter)
 static inline void *ib_get_client_data(struct ib_device *device,
 				       struct ib_client *client)
 {
-	return xa_load(&device->client_data, client->client_id);
+	if (client->net_client) 
+		return xa_load(&device->net_client_data, client->client_id);
+	else
+		return xa_load(&device->client_data, client->client_id);
 }
 void  ib_set_client_data(struct ib_device *device, struct ib_client *client,
 			 void *data);

[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux