On Mon, 2017-02-27 at 18:08 +0200, Alex Rosenbaum wrote: > > > Today the IB device list is returned by ibv_get_device_list() from > libibverbs. > > This IB device list is created by scanning of the > /sys/class/infiniband_verbs > > [2]. The list is cached and never updated, no matter if there were > hardware > > changes in the system. Correct. This is something I've felt we've needed to update/fix for a while now. > Detection of hotplug events is not part of libibverbs functionality > and thus > > not part of this RFC. User space applications should monitor > respectful changes > > in the system according to its specific logic to detect plugout or > plugin of > > hardware devices. This can be does by means of: netlink, udev or > other inputs. On this point I disagree. I see no reason that we should implement hotplug detection in every application when the library can get it right just once and make it available to everyone. > Here I suggest to modify the implementation of ibv_get_device_list() > so that > > consecutive calls will re-scan the sysfs in the same manner as done > today in > > order to create a fresh ibv_device list each time. We will remove > caching of > > devices that support plugout, while keeping the ibv_device cache for > devices > > which do not support plugout. I would argue that a better solution is to, on first call into ibv_get_device_list() (which doubles as the library's init() entry), we scan the system for whatever devices there are, build a permanent list of these devices, then register the necessary hooks ourselves to get updates on both plug out and plug in, and on those events we update the list as appropriate (on plug out, notify the application via async event, after the application has processed the event, if the device is unused, remove the device entirely, if the device is still in use, then move it to a defunct list and wait for the application to be able to release all of its references to it, and only then remove it; on plug in, add a new device to the device list and notify the application via async event at which point the application can decide if it wants to use the device). This might actually involve a new async notifier added to the libibverbs API that is not tied to a device and which an application would need to be modified to register with the library. But since any applicaiton wanting to support this is going to need to be modified anyway, a new entry point isn't the end of the world. > > Applications Behavior > > ------------------------------------------------------------------- > ------------- > > Applications use different logic to decide which ibv_device is the > relevant > > device they want to use. And each application has its own detection > logic to > > track such changes in device availability. > > Few examples: librdmacm logic is based on GUID values. Socket > acceleration > > (libvma) maps an IB device to its corresponding net iface based on > netlink and > > sysfs. DPDK applications lookup the IB device PCI address. And most > MPI > > implementation want human specified IB device name in command line > and will > > probably not handle any hotplug (out or in) events. > > > > It is the application's responsibility to check which ibv_device > returned from > > ibv_get_device_list() has changed from previous scan and which is of > interest. With an async event notifier, and with a static list of devices, it would be possible to simply pass the new device in with the notifier. Makes things much simpler for the app IMO. > > > Verbs can issue an IBV_EVENT_DEVICE_FATAL async event on an open user > space > > ibv_context for device's which support the ib_device- > >disassociate_ucontext(). > > This event will indicate to the application that the device is no > longer > > operational. In addition, user space CQ channel fd’s blocking calls > on recv(), > > select(), poll() or epoll() will be released with EINTR errno. > > > > Typical user space application will monitor hardware changes and/or > call for > > ibv_get_device_list() only from control path dedicated thread, and > not from the > > fast path threads. > > > > Pitfall > > ------------------------------------------------------------------- > ------------- > > If a legacy user space application did not follow the > ibv_get_device_list() > > man page definition, and it saved a private copy of an ibv_device > pointer and > > used it after releasing the device list (call to > ibv_free_device_list()), then > > ibv_open_device() might seg-fault based on this new suggestion. > > We can work around this by moving the IB device re-scan logic to a > new API > > 'ibv_refresh_device_list()' so that only new application using this > API will > > have correct behavior as needed. Following what I wrote above, this would no longer be a problem. The call to ibv_get_device_list can simply return the existing list (on first call we init this list), and we can make ibv_free_device_list a no-op. The async events can notify of adds/removes, user can update their own list accordingly, but we would also protect ourselves in all of the ibv_ calls such that if the user space code calls into our code with a defunct device we don't crash. We might want to protect all of our calls, even hot path calls, from invalid ibv_context pointers now as a result of this support. > > > Reference > > ------------------------------------------------------------------- > ------------- > > [1] https://www.kernel.org/doc/pending/hotplug.txt > > [2] https://github.com/linux-rdma/rdma-core/blob/master/Documentation > /libibverbs.md > > > > API changes > > ------------------------------------------------------------------- > ------------- > > Signed-off-by: Alex Rosenbaum <alexr@xxxxxxxxxxxx> > > --- a/libibverbs/verbs.h > > +++ b/libibverbs/verbs.h > > @@ -1336,6 +1336,7 @@ struct verbs_device { > > struct ibv_context *ctx, int cmd_fd); > > void (*uninit_context)(struct verbs_device *device, > > struct ibv_context *ctx); > > + void (*uninit_device)(struct verbs_device *device); > > + atomic_t ref_count; > > /* future fields added here */ > > }; > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" > in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Doug Ledford <dledford@xxxxxxxxxx> GPG KeyID: B826A3330E572FDD Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD
Attachment:
signature.asc
Description: This is a digitally signed message part