On 12/05/2020 13:51, Alex Rosenbaum wrote: > On Tue, May 12, 2020 at 11:24 AM Gal Pressman <galpress@xxxxxxxxxx> wrote: >> >> On 11/05/2020 18:35, Yishai Hadas wrote: >>> On 5/11/2020 5:31 PM, Gal Pressman wrote: >>>> On 11/05/2020 16:12, Yishai Hadas wrote: >>>>> Introduce import verbs for device, PD, MR, it enables processes to share >>>>> their ibv_contxet and then share PD and MR that is associated with. >>>>> >>>>> A process is creating a device and then uses some of the Linux systems >>>>> calls to dup its 'cmd_fd' member which lets other process to obtain >>>>> owning on. >>>>> >>>>> Once other process obtains the 'cmd_fd' it can call ibv_import_device() >>>>> which returns an ibv_contxet on the original RDMA device. >>>>> >>>>> On the imported device there is an option to import PD(s) and MR(s) to >>>>> achieve a sharing on those objects. >>>>> >>>>> This is the responsibility of the application to coordinate between all >>>>> ibv_context(s) that use the imported objects, such that once destroy is >>>>> done no other process can touch the object except for unimport. All >>>>> users of the context must collaborate to ensure this. >>>>> >>>>> A matching unimport verbs where introduced for PD and MR, for the device >>>>> the ibv_close_device() API should be used. >>>>> >>>>> Detailed man pages are introduced as part of this RFC patch to clarify >>>>> the expected usage and notes. >>>>> >>>>> Signed-off-by: Yishai Hadas <yishaih@xxxxxxxxxxxx> >>>> >>>> Hi Yishai, >>>> >>>> A few questions: >>>> Can you please explain the use case? I remember there was a discussion on the >>>> previous shared PD kernel submission (by Yuval and Shamir) but I'm not sure if >>>> there was a conclusion. >>>> >>> >>> The expected flow and use case are as follows. >>> >>> One process creates an ibv_context by calling ibv_open_device() and then enables >>> owning of its 'cmd_fd' with other processes by some Linux system call, (see man >>> page as part of this RFC for some alternatives). Then other process that owns >>> this 'cmd_fd' will be able to have its own ibv_context for the same RDMA device >>> by calling ibv_import_device(). >>> >>> At that point those processes really work on same kernel context and PD(s), >>> MR(s) and potentially other objects in the future can be shared by calling >>> ibv_import_pd()/mr() assuming that the initiator process let's the other ones >>> know the kernel handle value. >>> >>> Once a PD and MR which points to this PD were shared it enables a memory that >>> was registered by one process to be used by others with the matching lkey/rkey >>> for RDMA operations. >> >> Thanks Yishai. >> Which type of applications need this kind of functionality? > > Any solution which is a single business logic based on multi-process > design needs this. > Example include NGINX, with TCP load balancing, sharing the RSS > indirection table with RQ per process. > HPC frameworks with multi-rank(process) solution on single hosts. UCX > can share IB resources using the shared PD and can help dispatch data > to multiple processes/MR's in single RDMA operation. > Also, we have solutions in which the primary processes registered a > large shared memory range, and each worker process spawned will create > a private QP on the shared PD, and use the shared MR to save the > registration time per-process. > >> >>>> Could you please elaborate more how the process cleanup flow (e.g killed >>>> process) is going to change? I know it's a very broad question but I'm just >>>> trying to get the general idea. >>>> >>> >>> For now the model in those suggested APIs is that cleanup will be done or >>> explicitly by calling the relevant destroy command or alternatively once all >>> processes that own the cmd_fd will be closed. >>> >>> From kernel side there is only one object and its ref count is not increased as >>> part of the import_xxx() functions, see in the man pages some notes regarding >>> this point. >> >> ACK. >> >>>> What's expected to happen in a case where we have two processes P1 & P2, both >>>> use a shared PD, but separate MRs and QPs (created under the same shared PD). >>>> Now when an RDMA read request arrives at P2's QP, but refers to an MR of P1 >>>> (which was not imported, but under the same PD), how would you expect the device >>>> to handle that? >>>> >>> >>> The processes are behaving almost like 2 threads each have a QP and an MR, if >>> you mix them around it will work just like any buggy software. >>> In this case I would expect the device to scatter to the MR that was pointed by >>> the RDMA read request, any reason that it will behave differently ? >> >> I meant that the process is the RDMA read responder, not requester (although >> it's very similar), are we OK with one process accessing memory of a different >> process even though the MR isn't exported? >> >> I'm wondering whether there are any assumption about the "security" model of >> this feature, or are both processes considered exactly the same. Especially >> since both the kernel and the device aren't aware of the shared resources. > > The RDMA security model is bound to the protection domain, so once the > application logic shared it's PD (via the 'handle') it shared extended > the security scope. > >> It's a bit confusing that some of the resources are shared while others aren't >> though all created using the same PD. > > In this RFC, the shared resource are only stateless resource. Just > import the resource, based on handle, and you have access. > Current design doesn't add any shared state for resources running on > different process memory spaces, objects like QP, CQ, need user-space > state shared to be really usable between processes ... hopefully some > days we'll get their. Thanks Alex. Let me know if I'm missing anything but assuming I'm importing an MR, I realise that the address and length fields aren't going to be valid, but still the MR points to physical memory that probably isn't in my address space. So the process has access to post operations on the MR, but can't access its data? How's the implementation of the new callbacks going to look like? It sounds like this feature doesn't involve the device at all, in that case I assume it won't involve the providers? Is it going to be a generic libibverbs implementation?