Introduction -------------------------------------------------------------------------------- Hotplug [1] is the method in which new hardware becomes available in the system or is being removed from it. User space applications would like to continue operation while hardware is being changed in the system, without the need to restart the process, lose its current state and lose open sessions (at least on other available hardware). The problem -------------------------------------------------------------------------------- Today the IB device list is returned by ibv_get_device_list() from libibverbs. This IB device list is created by scanning of the /sys/class/infiniband_verbs [2]. The list is cached and never updated, no matter if there were hardware changes in the system. Detection of hotplug events is not part of libibverbs functionality and thus not part of this RFC. User space applications should monitor respectful changes in the system according to its specific logic to detect plugout or plugin of hardware devices. This can be does by means of: netlink, udev or other inputs. Suggested Solution -------------------------------------------------------------------------------- In order for user space applications to support hotplug of IB device (PlugIn and PlugOut), libibverbs must be able to provide the application access to new ibv_device objects, according to the recent system hardware changes. Here I suggest to modify the implementation of ibv_get_device_list() so that consecutive calls will re-scan the sysfs in the same manner as done today in order to create a fresh ibv_device list each time. We will remove caching of devices that support plugout, while keeping the ibv_device cache for devices which do not support plugout. For this purpose, the ibv_get_device_list() device scanning logic should be separated from the libibverbs singleton initialization step. User can call ibv_open_device() while holding this list (see man pages) and once ibv_free_device_list() is called libibverbs can release the unused ibv_device objects. Later, on calls to ibv_close_device(), additional ibv_device object should be released. Currently, on ibv_free_device_list(), only the array is freed, while the ibv_device objects are never freed. libibverbs will maintain a ref_count for each verbs_device object. Increase verbs_device->ref_count for every ibv_get_device_list() or ibv_open_device(). Decrease it for every ibv_free_device_list() or ibv_close_device(). On decrease, if ref_count tested to be zero, libibverbs will call the provider library to release the 'strcut verbs_device' which it allocated. Each provider library should provide a function to release the verbs_device object: 'uninit_device(struct verbs_device* device)'. In order to prevent resource leak for provider libraries that do not support plugout API, libibverbs will move the relevant ibv_device’s to a cached device list which will never be refreshed (like today) and also remove the respectful provider library (ibv_driver) from the registered driver list. Remove of the ibv_driver will make sure future scans of the sysfs will not generate additional copies of the same ibv_device. Applications Behavior -------------------------------------------------------------------------------- Applications use different logic to decide which ibv_device is the relevant device they want to use. And each application has its own detection logic to track such changes in device availability. Few examples: librdmacm logic is based on GUID values. Socket acceleration (libvma) maps an IB device to its corresponding net iface based on netlink and sysfs. DPDK applications lookup the IB device PCI address. And most MPI implementation want human specified IB device name in command line and will probably not handle any hotplug (out or in) events. It is the application's responsibility to check which ibv_device returned from ibv_get_device_list() has changed from previous scan and which is of interest. Verbs can issue an IBV_EVENT_DEVICE_FATAL async event on an open user space ibv_context for device's which support the ib_device->disassociate_ucontext(). This event will indicate to the application that the device is no longer operational. In addition, user space CQ channel fd’s blocking calls on recv(), select(), poll() or epoll() will be released with EINTR errno. Typical user space application will monitor hardware changes and/or call for ibv_get_device_list() only from control path dedicated thread, and not from the fast path threads. Pitfall -------------------------------------------------------------------------------- If a legacy user space application did not follow the ibv_get_device_list() man page definition, and it saved a private copy of an ibv_device pointer and used it after releasing the device list (call to ibv_free_device_list()), then ibv_open_device() might seg-fault based on this new suggestion. We can work around this by moving the IB device re-scan logic to a new API 'ibv_refresh_device_list()' so that only new application using this API will have correct behavior as needed. Reference -------------------------------------------------------------------------------- [1] https://www.kernel.org/doc/pending/hotplug.txt [2] https://github.com/linux-rdma/rdma-core/blob/master/Documentation/libibverbs.md API changes -------------------------------------------------------------------------------- Signed-off-by: Alex Rosenbaum <alexr@xxxxxxxxxxxx> --- a/libibverbs/verbs.h +++ b/libibverbs/verbs.h @@ -1336,6 +1336,7 @@ struct verbs_device { struct ibv_context *ctx, int cmd_fd); void (*uninit_context)(struct verbs_device *device, struct ibv_context *ctx); + void (*uninit_device)(struct verbs_device *device); + atomic_t ref_count; /* future fields added here */ }; -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html