> -----Original Message----- > From: Yuval Shaia [mailto:yuval.shaia@xxxxxxxxxx] > Sent: Tuesday, January 23, 2018 10:30 PM > To: Alex Margolin <alexma@xxxxxxxxxxxx>; Marcel Apfelbaum > <marcel@xxxxxxxxxx> > Cc: Jason Gunthorpe <jgg@xxxxxxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx > Subject: Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory > registration > > On Mon, Jan 22, 2018 at 03:59:51PM +0000, Alex Margolin wrote: > > > -----Original Message----- > > > From: Jason Gunthorpe > > > Sent: Thursday, January 11, 2018 6:45 PM > > > To: Yuval Shaia <yuval.shaia@xxxxxxxxxx> > > > Cc: Alex Margolin <alexma@xxxxxxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx > > > Subject: Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous > > > memory registration > > > > > > On Thu, Jan 11, 2018 at 02:22:07PM +0200, Yuval Shaia wrote: > > > > > +The following code example demonstrates non-contiguous memory > > > > > +registration, by combining two contiguous regions, along with > > > > > +the > > > WR-based completion semantic: > > > > > +.PP > > > > > +.nf > > > > > +mr1 = ibv_reg_mr(pd, addr1, len1, 0); if (!mr1) { > > > > > + fprintf(stderr, "Failed to create MR #1\en"); > > > > > + return 1; > > > > > +} > > > > > + > > > > > +mr2 = ibv_reg_mr(pd, addr2, len2, 0); if (!mr2) { > > > > > + fprintf(stderr, "Failed to create MR #2\en"); > > > > > + return 1; > > > > > +} > > > > > > > > So, to register non-contiguous 512 random buffers i would have to > > > > create > > > > 512 MRs? > > > > > > I think typically if you have a large amount of buffers - it would be > located in fairly close proximity, so you'd prefer one MR to cover all > of them and the SGEs will only differ in base address. > > Define "large amount". > I did several experiments with something like hundred or few hundred > (Marcel, do you remember how many?) and they were scattered at the range > of about 3G so one MR is not an option. Our application is QEMU so 3G > for one MR means no memory overcommit. > > > > > Are you proposing the function also replaces ibv_reg_mr() if the user > passes multiple unregistered regions? > > I could see the benefit, but then we'd require additional parameters > (i.e. send_flags) and those MRs couldn't be reused (otherwise need to > add output pointers for resulting MRs). Actually, I realized it can be implemented with the proposed API. All that is missing is a capability bit and a flag for set_layout_*, and the implementation could work as follows (changes relative to SG example): +assert(caps & IBV_MR_SET_LAYOUT_INTERNAL_REGISTRATION); -mr1 = ibv_reg_mr(pd, addr1, len1, 0); -if (!mr1) { - fprintf(stderr, "Failed to create MR #1\en"); - return 1; -} - -mr2 = ibv_reg_mr(pd, addr2, len2, 0); -if (!mr2) { - fprintf(stderr, "Failed to create MR #2\en"); - return 1; -} mr3 = ibv_reg_mr(pd, NULL, 0, IBV_ACCESS_ZERO_BASED); if (!mr3) { fprintf(stderr, "Failed to create result MR\en"); return 1; } struct ibv_sge composite[] = { { .addr = addr1, .length = len1, - .lkey = mr1->lkey }, { .addr = addr2, .length = len2, - .lkey = mr2->lkey } }; +ret = ibv_mr_set_layout_sg(mr3, IBV_MR_SET_LAYOUT_REGISTER_BUFFERS, 2, composite); -ret = ibv_mr_set_layout_sg(mr3, 0, 2, composite); if (ret) { fprintf(stderr, "Non-contiguous registration failed\en"); return 1; } In this case calling ibv_mr_set_layout_sg() will cause an internal registration replacing the ibv_reg_mr calls for mr1 and mr2, and the registration will be stored in mr3. Is this what you had in mind? > > Yeah, more or less the same ib_reg_mr but one that gets list of pages > instead of virtual address and will skip the "while (npages)" loop in > ib_umem_get and just go directly to dma_map_sg. Idea here is that anyway > the HW supports scattered list of buffers so why to limit the API to > contiguous virtual address. > > We dropped this idea as it turns out that we need extra help from the HW > in post_send phase where the virtual address received in the SGE refers > to the virtual address given at ib_reg_mr. > We somehow believed that zero-based-mr will solve this by maybe allowing > addresses in SGE to be something like an index to a entry in the page- > list given to ib_reg_mr but apparently zero-based-mr is not yet > functional (at least not in CX3). > (We have lack of knowledge in what exactly zero-based-mr is). > > > The benefit will probably not be latency, though, since IIRC the MR > creation can't really be parallelized. > > Yuval - are you aware of a scenario implementing a high amount of > ibv_reg_mr() calls? > > High amount of ibv_reg_mr calls no but i have a scenario where my > application can potentially receive request to create MR for 262144 > scattered pages. > By the way, using the suggested API from Jason below, SG list will still > limits us, not sure how big SG list can be but sure not 262144. > So what we were thinking is to give ib_reg_mr a huge range, even 4G but > then use a bitmap parameter that will specify only the pages in that > range that take part in the MR. > > > > > > > > > That is a fair point - I wonder if some of these API should have an > > > option to accept a pointer directly? Maybe the driver requires a MR > > > but we don't need that as an the API? > > > > > > Particularly the _sg one.. > > > > > > Jason > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo > > info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html