Re: [RFC] libibverbs IB device hotplug support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 2017-02-27 at 18:08 +0200, Alex Rosenbaum wrote:
> 
> 
> Today the IB device list is returned by ibv_get_device_list() from
> libibverbs.
> 
> This IB device list is created by scanning of the
> /sys/class/infiniband_verbs
> 
> [2]. The list is cached and never updated, no matter if there were
> hardware
> 
> changes in the system.

Correct.  This is something I've felt we've needed to update/fix for a
while now.

> Detection of hotplug events is not part of libibverbs functionality
> and thus
> 
> not part of this RFC. User space applications should monitor
> respectful changes
> 
> in the system according to its specific logic to detect plugout or
> plugin of
> 
> hardware devices. This can be does by means of: netlink, udev or
> other inputs.

On this point I disagree.  I see no reason that we should implement
hotplug detection in every application when the library can get it
right just once and make it available to everyone.

> Here I suggest to modify the implementation of ibv_get_device_list()
> so that
> 
> consecutive calls will re-scan the sysfs in the same manner as done
> today in
> 
> order to create a fresh ibv_device list each time. We will remove
> caching of
> 
> devices that support plugout, while keeping the ibv_device cache for
> devices
> 
> which do not support plugout.

I would argue that a better solution is to, on first call into
ibv_get_device_list() (which doubles as the library's init() entry), we
scan the system for whatever devices there are, build a permanent list
of these devices, then register the necessary hooks ourselves to get
updates on both plug out and plug in, and on those events we update the
list as appropriate (on plug out, notify the application via async
event, after the application has processed the event, if the device is
unused, remove the device entirely, if the device is still in use, then
move it to a defunct list and wait for the application to be able to
release all of its references to it, and only then remove it; on plug
in, add a new device to the device list and notify the application via
async event at which point the application can decide if it wants to
use the device).  This might actually involve a new async notifier
added to the libibverbs API that is not tied to a device and which an
application would need to be modified to register with the library.
 But since any applicaiton wanting to support this is going to need to
be modified anyway, a new entry point isn't the end of the world.

> 
> Applications Behavior
> 
> -------------------------------------------------------------------
> -------------
> 
> Applications use different logic to decide which ibv_device is the
> relevant
> 
> device they want to use. And each application has its own detection
> logic to
> 
> track such changes in device availability.
> 
> Few examples: librdmacm logic is based on GUID values. Socket
> acceleration
> 
> (libvma) maps an IB device to its corresponding net iface based on
> netlink and
> 
> sysfs. DPDK applications lookup the IB device PCI address. And most
> MPI
> 
> implementation want human specified IB device name in command line
> and will
> 
> probably not handle any hotplug (out or in) events.
> 
> 
> 
> It is the application's responsibility to check which ibv_device
> returned from
> 
> ibv_get_device_list() has changed from previous scan and which is of
> interest.

With an async event notifier, and with a static list of devices, it
would be possible to simply pass the new device in with the notifier.
 Makes things much simpler for the app IMO.

> 
> 
> Verbs can issue an IBV_EVENT_DEVICE_FATAL async event on an open user
> space
> 
> ibv_context for device's which support the ib_device-
> >disassociate_ucontext().
> 
> This event will indicate to the application that the device is no
> longer
> 
> operational. In addition, user space CQ channel fd’s blocking calls
> on recv(),
> 
> select(), poll() or epoll() will be released with EINTR errno.
> 
> 
> 
> Typical user space application will monitor hardware changes and/or
> call for
> 
> ibv_get_device_list() only from control path dedicated thread, and
> not from the
> 
> fast path threads.
> 
> 
> 
> Pitfall
> 
> -------------------------------------------------------------------
> -------------
> 
> If a legacy user space application did not follow the
> ibv_get_device_list()
> 
> man page definition, and it saved a private copy of an ibv_device
> pointer and
> 
> used it after releasing the device list (call to
> ibv_free_device_list()), then
> 
> ibv_open_device() might seg-fault based on this new suggestion.
> 
> We can work around this by moving the IB device re-scan logic to a
> new API
> 
> 'ibv_refresh_device_list()' so that only new application using this
> API will
> 
> have correct behavior as needed.

Following what I wrote above, this would no longer be a problem.  The
call to ibv_get_device_list can simply return the existing list (on
first call we init this list), and we can make ibv_free_device_list a
no-op.  The async events can notify of adds/removes, user can update
their own list accordingly, but we would also protect ourselves in all
of the ibv_ calls such that if the user space code calls into our code
with a defunct device we don't crash.  We might want to protect all of
our calls, even hot path calls, from invalid ibv_context pointers now
as a result of this support.


> 
> 
> Reference
> 
> -------------------------------------------------------------------
> -------------
> 
> [1] https://www.kernel.org/doc/pending/hotplug.txt
> 
> [2] https://github.com/linux-rdma/rdma-core/blob/master/Documentation
> /libibverbs.md
> 
> 
> 
> API changes
> 
> -------------------------------------------------------------------
> -------------
> 
> Signed-off-by: Alex Rosenbaum <alexr@xxxxxxxxxxxx>
> 
> --- a/libibverbs/verbs.h
> 
> +++ b/libibverbs/verbs.h
> 
> @@ -1336,6 +1336,7 @@ struct verbs_device {
> 
>                                 struct ibv_context *ctx, int cmd_fd);
> 
>         void    (*uninit_context)(struct verbs_device *device,
> 
>                                 struct ibv_context *ctx);
> 
> +       void    (*uninit_device)(struct verbs_device *device);
> 
> +       atomic_t         ref_count;
> 
>         /* future fields added here */
> 
> };
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" 
> in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Doug Ledford <dledford@xxxxxxxxxx>
    GPG KeyID: B826A3330E572FDD
   
Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

Attachment: signature.asc
Description: This is a digitally signed message part


[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux