> -----Original Message----- > From: Alex Margolin > Sent: Thursday, January 25, 2018 2:43 PM > To: 'Yuval Shaia' <yuval.shaia@xxxxxxxxxx>; Marcel Apfelbaum > <marcel@xxxxxxxxxx> > Cc: Jason Gunthorpe <jgg@xxxxxxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx > Subject: RE: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory > registration > > > > > -----Original Message----- > > From: Yuval Shaia [mailto:yuval.shaia@xxxxxxxxxx] > > Sent: Tuesday, January 23, 2018 10:30 PM > > To: Alex Margolin <alexma@xxxxxxxxxxxx>; Marcel Apfelbaum > > <marcel@xxxxxxxxxx> > > Cc: Jason Gunthorpe <jgg@xxxxxxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx > > Subject: Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous > > memory registration > > > > On Mon, Jan 22, 2018 at 03:59:51PM +0000, Alex Margolin wrote: > > > > -----Original Message----- > > > > From: Jason Gunthorpe > > > > Sent: Thursday, January 11, 2018 6:45 PM > > > > To: Yuval Shaia <yuval.shaia@xxxxxxxxxx> > > > > Cc: Alex Margolin <alexma@xxxxxxxxxxxx>; > > > > linux-rdma@xxxxxxxxxxxxxxx > > > > Subject: Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous > > > > memory registration > > > > > > > > On Thu, Jan 11, 2018 at 02:22:07PM +0200, Yuval Shaia wrote: > > > > > > +The following code example demonstrates non-contiguous memory > > > > > > +registration, by combining two contiguous regions, along with > > > > > > +the > > > > WR-based completion semantic: > > > > > > +.PP > > > > > > +.nf > > > > > > +mr1 = ibv_reg_mr(pd, addr1, len1, 0); if (!mr1) { > > > > > > + fprintf(stderr, "Failed to create MR #1\en"); > > > > > > + return 1; > > > > > > +} > > > > > > + > > > > > > +mr2 = ibv_reg_mr(pd, addr2, len2, 0); if (!mr2) { > > > > > > + fprintf(stderr, "Failed to create MR #2\en"); > > > > > > + return 1; > > > > > > +} > > > > > > > > > > So, to register non-contiguous 512 random buffers i would have > > > > > to create > > > > > 512 MRs? > > > > > > > > > I think typically if you have a large amount of buffers - it would > > > be > > located in fairly close proximity, so you'd prefer one MR to cover all > > of them and the SGEs will only differ in base address. > > > > Define "large amount". > > I did several experiments with something like hundred or few hundred > > (Marcel, do you remember how many?) and they were scattered at the > > range of about 3G so one MR is not an option. Our application is QEMU > > so 3G for one MR means no memory overcommit. > > > > > > > > Are you proposing the function also replaces ibv_reg_mr() if the > > > user > > passes multiple unregistered regions? > > > I could see the benefit, but then we'd require additional parameters > > (i.e. send_flags) and those MRs couldn't be reused (otherwise need to > > add output pointers for resulting MRs). > > Actually, I realized it can be implemented with the proposed API. > All that is missing is a capability bit and a flag for set_layout_*, and > the implementation could work as follows (changes relative to SG > example): > > +assert(caps & IBV_MR_SET_LAYOUT_INTERNAL_REGISTRATION); > -mr1 = ibv_reg_mr(pd, addr1, len1, 0); > -if (!mr1) { > - fprintf(stderr, "Failed to create MR #1\en"); > - return 1; > -} > - > -mr2 = ibv_reg_mr(pd, addr2, len2, 0); > -if (!mr2) { > - fprintf(stderr, "Failed to create MR #2\en"); > - return 1; > -} > > mr3 = ibv_reg_mr(pd, NULL, 0, IBV_ACCESS_ZERO_BASED); if (!mr3) { > fprintf(stderr, "Failed to create result MR\en"); > return 1; > } > > struct ibv_sge composite[] = > { > { > .addr = addr1, > .length = len1, > - .lkey = mr1->lkey > }, > { > .addr = addr2, > .length = len2, > - .lkey = mr2->lkey > } > }; > > +ret = ibv_mr_set_layout_sg(mr3, IBV_MR_SET_LAYOUT_REGISTER_BUFFERS, 2, > +composite); > -ret = ibv_mr_set_layout_sg(mr3, 0, 2, composite); if (ret) { > fprintf(stderr, "Non-contiguous registration failed\en"); > return 1; > } > > In this case calling ibv_mr_set_layout_sg() will cause an internal > registration replacing the ibv_reg_mr calls for mr1 and mr2, and the > registration will be stored in mr3. Forgot to add - MR creation parameters, such as access flags, will be taken from mr3 reg_mr call. > > Is this what you had in mind? > > > > > Yeah, more or less the same ib_reg_mr but one that gets list of pages > > instead of virtual address and will skip the "while (npages)" loop in > > ib_umem_get and just go directly to dma_map_sg. Idea here is that > > anyway the HW supports scattered list of buffers so why to limit the > > API to contiguous virtual address. > > > > We dropped this idea as it turns out that we need extra help from the > > HW in post_send phase where the virtual address received in the SGE > > refers to the virtual address given at ib_reg_mr. > > We somehow believed that zero-based-mr will solve this by maybe > > allowing addresses in SGE to be something like an index to a entry in > > the page- list given to ib_reg_mr but apparently zero-based-mr is not > > yet functional (at least not in CX3). > > (We have lack of knowledge in what exactly zero-based-mr is). > > > > > The benefit will probably not be latency, though, since IIRC the MR > > creation can't really be parallelized. > > > Yuval - are you aware of a scenario implementing a high amount of > > ibv_reg_mr() calls? > > > > High amount of ibv_reg_mr calls no but i have a scenario where my > > application can potentially receive request to create MR for 262144 > > scattered pages. > > By the way, using the suggested API from Jason below, SG list will > > still limits us, not sure how big SG list can be but sure not 262144. > > So what we were thinking is to give ib_reg_mr a huge range, even 4G > > but then use a bitmap parameter that will specify only the pages in > > that range that take part in the MR. > > > > > > > > > > > > > That is a fair point - I wonder if some of these API should have > > > > an option to accept a pointer directly? Maybe the driver requires > > > > a MR but we don't need that as an the API? > > > > > > > > Particularly the _sg one.. > > > > > > > > Jason > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux- > rdma" > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo > > > info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html