RE: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory registration

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> -----Original Message-----
> From: Yuval Shaia [mailto:yuval.shaia@xxxxxxxxxx]
> Sent: Tuesday, January 23, 2018 10:30 PM
> To: Alex Margolin <alexma@xxxxxxxxxxxx>; Marcel Apfelbaum
> <marcel@xxxxxxxxxx>
> Cc: Jason Gunthorpe <jgg@xxxxxxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx
> Subject: Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory
> registration
> 
> On Mon, Jan 22, 2018 at 03:59:51PM +0000, Alex Margolin wrote:
> > > -----Original Message-----
> > > From: Jason Gunthorpe
> > > Sent: Thursday, January 11, 2018 6:45 PM
> > > To: Yuval Shaia <yuval.shaia@xxxxxxxxxx>
> > > Cc: Alex Margolin <alexma@xxxxxxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx
> > > Subject: Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous
> > > memory registration
> > >
> > > On Thu, Jan 11, 2018 at 02:22:07PM +0200, Yuval Shaia wrote:
> > > > > +The following code example demonstrates non-contiguous memory
> > > > > +registration, by combining two contiguous regions, along with
> > > > > +the
> > > WR-based completion semantic:
> > > > > +.PP
> > > > > +.nf
> > > > > +mr1 = ibv_reg_mr(pd, addr1, len1, 0); if (!mr1) {
> > > > > +        fprintf(stderr, "Failed to create MR #1\en");
> > > > > +        return 1;
> > > > > +}
> > > > > +
> > > > > +mr2 = ibv_reg_mr(pd, addr2, len2, 0); if (!mr2) {
> > > > > +        fprintf(stderr, "Failed to create MR #2\en");
> > > > > +        return 1;
> > > > > +}
> > > >
> > > > So, to register non-contiguous 512 random buffers i would have to
> > > > create
> > > > 512 MRs?
> >
> >
> > I think typically if you have a large amount of buffers - it would be
> located in fairly close proximity, so you'd prefer one MR to cover all
> of them and the SGEs will only differ in base address.
> 
> Define "large amount".
> I did several experiments with something like hundred or few hundred
> (Marcel, do you remember how many?) and they were scattered at the range
> of about 3G so one MR is not an option. Our application is QEMU so 3G
> for one MR means no memory overcommit.
> 
> >
> > Are you proposing the function also replaces ibv_reg_mr() if the user
> passes multiple unregistered regions?
> > I could see the benefit, but then we'd require additional parameters
> (i.e. send_flags) and those MRs couldn't be reused (otherwise need to
> add output pointers for resulting MRs).

Actually, I realized it can be implemented with the proposed API.
All that is missing is a capability bit and a flag for set_layout_*,
and the implementation could work as follows (changes relative to SG example):

+assert(caps & IBV_MR_SET_LAYOUT_INTERNAL_REGISTRATION);
-mr1 = ibv_reg_mr(pd, addr1, len1, 0);
-if (!mr1) {
-        fprintf(stderr, "Failed to create MR #1\en");
-        return 1;
-}
-
-mr2 = ibv_reg_mr(pd, addr2, len2, 0);
-if (!mr2) {
-        fprintf(stderr, "Failed to create MR #2\en");
-        return 1;
-}

mr3 = ibv_reg_mr(pd, NULL, 0, IBV_ACCESS_ZERO_BASED);
if (!mr3) {
        fprintf(stderr, "Failed to create result MR\en");
        return 1;
}

struct ibv_sge composite[] =
{
        {
                .addr = addr1,
                .length = len1,
-                .lkey = mr1->lkey
        },
        {
                .addr = addr2,
                .length = len2,
-                .lkey = mr2->lkey
        }
};

+ret = ibv_mr_set_layout_sg(mr3, IBV_MR_SET_LAYOUT_REGISTER_BUFFERS, 2, composite);
-ret = ibv_mr_set_layout_sg(mr3, 0, 2, composite);
if (ret) {
        fprintf(stderr, "Non-contiguous registration failed\en");
        return 1;
}

In this case calling ibv_mr_set_layout_sg() will cause an internal registration
replacing the ibv_reg_mr calls for mr1 and mr2, and the registration will be stored
in mr3.

Is this what you had in mind?

> 
> Yeah, more or less the same ib_reg_mr but one that gets list of pages
> instead of virtual address and will skip the "while (npages)" loop in
> ib_umem_get and just go directly to dma_map_sg. Idea here is that anyway
> the HW supports scattered list of buffers so why to limit the API to
> contiguous virtual address.
> 
> We dropped this idea as it turns out that we need extra help from the HW
> in post_send phase where the virtual address received in the SGE refers
> to the virtual address given at ib_reg_mr.
> We somehow believed that zero-based-mr will solve this by maybe allowing
> addresses in SGE to be something like an index to a entry in the page-
> list given to ib_reg_mr but apparently zero-based-mr is not yet
> functional (at least not in CX3).
> (We have lack of knowledge in what exactly zero-based-mr is).
> 
> > The benefit will probably not be latency, though, since IIRC the MR
> creation can't really be parallelized.
> > Yuval - are you aware of a scenario implementing a high amount of
> ibv_reg_mr() calls?
> 
> High amount of ibv_reg_mr calls no but i have a scenario where my
> application can potentially receive request to create MR for 262144
> scattered pages.
> By the way, using the suggested API from Jason below, SG list will still
> limits us, not sure how big SG list can be but sure not 262144.
> So what we were thinking is to give ib_reg_mr a huge range, even 4G but
> then use a bitmap parameter that will specify only the pages in that
> range that take part in the MR.
> 
> >
> > >
> > > That is a fair point - I wonder if some of these API should have an
> > > option to accept a pointer directly? Maybe the driver requires a MR
> > > but we don't need that as an the API?
> > >
> > > Particularly the _sg one..
> > >
> > > Jason
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux