From: Parav Pandit <parav@xxxxxxxxxxxx> Describe ib_core_device, ib_device association and their existence in net namespaces for backward compatibility, and locking scheme. Signed-off-by: Parav Pandit <parav@xxxxxxxxxxxx> Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxxxx> --- Documentation/infiniband/core_devices.txt | 146 ++++++++++++++++++++++ 1 file changed, 146 insertions(+) create mode 100644 Documentation/infiniband/core_devices.txt diff --git a/Documentation/infiniband/core_devices.txt b/Documentation/infiniband/core_devices.txt new file mode 100644 index 000000000000..34f7d5cea54f --- /dev/null +++ b/Documentation/infiniband/core_devices.txt @@ -0,0 +1,146 @@ +Linux RDMA devices and their sysfs entries +------------------------------------------ + +1. Background +-------------- +RDMA networking devices have at least 3 link or transport layers. +(a) InfiniBand +(b) RoCE +(c) iWarp + +These networking devices provide kernel bypass for sending/receiving +data to/from the network. + +There are various modes in which these devices are used along with +other protocols for connection establishment and/or for data transfer. +Such as, +(a) rdmacm for connection establishement and verbs for data transfer. +(b) tcp/ip for connection establishment and verbs for data transfer. + +Additionally rdma devices can be shared among multiple net namespaces. + +It is also desired to have per net namespace rdma devices as the +stack matures. + +sysfs entries are heavily used for device discovery, statistics and network +addresses in rdma stack. + +Therefore, to have minimal impact on backward compatibility for these 3 +transports and to provide forward looking method, the following sysfs +isolation approach is taken. + +2. Design +---------- + +For every rdma ib_device, core code creates an ib_core_device in every +net namespace to give the appearance that the rdma device is present +in all net namespaces. +Each ib_core_device owns the sysfs entries in their net namespace. + +All ib_core_device(s) points to one owner ib_device using owner pointer. + +2.1 Shared rdma ib_device view in different net namespaces +----------------------------------------------------------- + + ib_core_device (net_ns_1) + +--------------+ + | | + | device | + | +----------+ | + | | | | + | | | | + | | | | + | +----------+ | (init_net) + | *net | ib_device + | *owner-------------------------+------>+--------------------+<--+ + +--------------+ | | | | + | | ib_core_device | | + | | +--------------+ | | + | | | | | | + | | | device | | | + | | | +----------+ | | | + ib_core_device (net_ns_2) | | | | | | | | + +--------------+ | | | | | | | | + | | | | | | | | | | + | device | | | | +----------+ | | | + | +----------+ | | | | *net | | | + | | | | | | | *owner--------------+ + | | | | | | +--------------+ | + | | | | | +--------------------+ + | +----------+ | | + | *net | | + | *owner------------------------+ + +--------------+ + +2.2 rdma ib_device bound to a net namespace (in future) +-------------------------------------------------------- + +In this mode, when an rdma device is bound to a net namespace, all compat +sysfs entries will be terminated. sysfs entries will reside in single +net namespace which device is bound to. +Thereby having one-to-one mapping and providing isolation of devices +to their owning net namespace. + +(net_ns_1) +ib_device ++--------------------+ +| | +| | +| ib_core_device | +| +--------------+ | +| | | | +| | device | | +| | +----------+ | | +| | | | | | +| | | | | | +| | | | | | +| | +----------+ | | +| | | | +| | *net | | +| | *owner | | +| +--------------+ | ++--------------------+ + +2.3 locking scheme +-------------------------------------------------------- +There are three locks involved to provide synchronization between five +operations. +These five operations are +(a) device addition using ib_register_device() +(b) device removal using ib_unregister_device() +(c) net namespace addition using _init_net() notifier +(d) net namespace removal using _exit_net() notifier +(e) device renaming netlink command + +Each of above operations can happen in parallel. +Few interesting combinations to consider are: +1. init_net() and register_device() trying to add compat devices +2. exit_net() and unregister_device() trying to remove compat devices +3. renaming compat devices while doing init_net() or exit_net(). + +Net namespaces are identified using a unique id in an xarray. +This xarray operation is protected using rdma_net_rwsem. +Same id is being used for adding compat device for a given rdma device. + +compat devices of a given ib device is maintained using per device xarray. +This xarray is used because two paths - net ns notifiers and device life cycle +routines, both attempt to add compat devices. Such work is protected using per +device compat_rw_mutex. + +Below lock sequence ensures that whoever sees the device adds/removes compat +devices for a given net namespace(s). + + cpu-0 cpu-1 + ----- ----- +init_net()/exit_net() reg_dev()/unreg_dev() + + lock_N lock_D + [..] [..] + unlock_N [..] + unlock_D + + lock_N + [..] + lock_D unlock_N + [..] + unlock_D -- 2.19.1