[PATCH rdma-next 7/8] RDMA/core: Add Documentation for ib_core_device

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



From: Parav Pandit <parav@xxxxxxxxxxxx>

Describe ib_core_device, ib_device association and their existence
in net namespaces for backward compatibility, and locking scheme.

Signed-off-by: Parav Pandit <parav@xxxxxxxxxxxx>
Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxxxx>
---
 Documentation/infiniband/core_devices.txt | 146 ++++++++++++++++++++++
 1 file changed, 146 insertions(+)
 create mode 100644 Documentation/infiniband/core_devices.txt

diff --git a/Documentation/infiniband/core_devices.txt b/Documentation/infiniband/core_devices.txt
new file mode 100644
index 000000000000..34f7d5cea54f
--- /dev/null
+++ b/Documentation/infiniband/core_devices.txt
@@ -0,0 +1,146 @@
+Linux RDMA devices and their sysfs entries
+------------------------------------------
+
+1. Background
+--------------
+RDMA networking devices have at least 3 link or transport layers.
+(a) InfiniBand
+(b) RoCE
+(c) iWarp
+
+These networking devices provide kernel bypass for sending/receiving
+data to/from the network.
+
+There are various modes in which these devices are used along with
+other protocols for connection establishment and/or for data transfer.
+Such as,
+(a) rdmacm for connection establishement and verbs for data transfer.
+(b) tcp/ip for connection establishment and verbs for data transfer.
+
+Additionally rdma devices can be shared among multiple net namespaces.
+
+It is also desired to have per net namespace rdma devices as the
+stack matures.
+
+sysfs entries are heavily used for device discovery, statistics and network
+addresses in rdma stack.
+
+Therefore, to have minimal impact on backward compatibility for these 3
+transports and to provide forward looking method, the following sysfs
+isolation approach is taken.
+
+2. Design
+----------
+
+For every rdma ib_device, core code creates an ib_core_device in every
+net namespace to give the appearance that the rdma device is present
+in all net namespaces.
+Each ib_core_device owns the sysfs entries in their net namespace.
+
+All ib_core_device(s) points to one owner ib_device using owner pointer.
+
+2.1 Shared rdma ib_device view in different net namespaces
+-----------------------------------------------------------
+
+  ib_core_device (net_ns_1)
+  +--------------+
+  |              |
+  | device       |
+  | +----------+ |
+  | |          | |
+  | |          | |
+  | |          | |
+  | +----------+ |                         (init_net)
+  | *net         |                         ib_device
+  | *owner-------------------------+------>+--------------------+<--+
+  +--------------+                 |       |                    |   |
+                                   |       |  ib_core_device    |   |
+                                   |       |  +--------------+  |   |
+                                   |       |  |              |  |   |
+                                   |       |  | device       |  |   |
+                                   |       |  | +----------+ |  |   |
+   ib_core_device (net_ns_2)       |       |  | |          | |  |   |
+   +--------------+                |       |  | |          | |  |   |
+   |              |                |       |  | |          | |  |   |
+   | device       |                |       |  | +----------+ |  |   |
+   | +----------+ |                |       |  | *net         |  |   |
+   | |          | |                |       |  | *owner--------------+
+   | |          | |                |       |  +--------------+  |
+   | |          | |                |       +--------------------+
+   | +----------+ |                |
+   | *net         |                |
+   | *owner------------------------+
+   +--------------+
+
+2.2 rdma ib_device bound to a net namespace (in future)
+--------------------------------------------------------
+
+In this mode, when an rdma device is bound to a net namespace, all compat
+sysfs entries will be terminated. sysfs entries will reside in single
+net namespace which device is bound to.
+Thereby having one-to-one mapping and providing isolation of devices
+to their owning net namespace.
+
+(net_ns_1)
+ib_device
++--------------------+
+|                    |
+|                    |
+|  ib_core_device    |
+|  +--------------+  |
+|  |              |  |
+|  | device       |  |
+|  | +----------+ |  |
+|  | |          | |  |
+|  | |          | |  |
+|  | |          | |  |
+|  | +----------+ |  |
+|  |              |  |
+|  | *net         |  |
+|  | *owner       |  |
+|  +--------------+  |
++--------------------+
+
+2.3 locking scheme
+--------------------------------------------------------
+There are three locks involved to provide synchronization between five
+operations.
+These five operations are
+(a) device addition using ib_register_device()
+(b) device removal using ib_unregister_device()
+(c) net namespace addition using _init_net() notifier
+(d) net namespace removal using _exit_net() notifier
+(e) device renaming netlink command
+
+Each of above operations can happen in parallel.
+Few interesting combinations to consider are:
+1. init_net() and register_device() trying to add compat devices
+2. exit_net() and unregister_device() trying to remove compat devices
+3. renaming compat devices while doing init_net() or exit_net().
+
+Net namespaces are identified using a unique id in an xarray.
+This xarray operation is protected using rdma_net_rwsem.
+Same id is being used for adding compat device for a given rdma device.
+
+compat devices of a given ib device is maintained using per device xarray.
+This xarray is used because two paths - net ns notifiers and device life cycle
+routines, both attempt to add compat devices. Such work is protected using per
+device compat_rw_mutex.
+
+Below lock sequence ensures that whoever sees the device adds/removes compat
+devices for a given net namespace(s).
+
+    cpu-0                          cpu-1
+    -----                          -----
+init_net()/exit_net()          reg_dev()/unreg_dev()
+
+    lock_N                          lock_D
+    [..]                            [..]
+    unlock_N                        [..]
+                                    unlock_D
+
+                                    lock_N
+                                    [..]
+     lock_D                         unlock_N
+     [..]
+     unlock_D
-- 
2.19.1




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux