Re: Kernel fast memory registration API proposal [RFC]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jul 15, 2015 at 12:32:33AM -0700, Christoph Hellwig wrote:
> int rdma_create_mr(struct ib_pd *pd, enum rdma_mr_type mr,
> 	u32 max_pages, int flags);
> 
> >   *     array from a SG list
> >   * @mr:          memory region
> >   * @sg:          sg list
> >   * @sg_nents:    number of elements in the sg
> >   *
> >   * Can fail if the HW is not able to register this
> >   * sg list. In case of failure - caller is responsible
> >   * to handle it (bounce-buffer, multiple registrations...)
> >   */
> > int ib_mr_set_sg(struct ib_mr *mr,
> >                  struct scatterlist *sg,
> >                  unsigned short sg_nents);
> 
> Call this rdma_map_sg?
> 
> >          /* register the MR */
> >          frwr.opcode = IB_WR_FAST_REG_MR;
> >          frwr.wrid = my_wrid;
> >          frwr.wr.fast_reg.mr = mr;
> >          frwr.wr.fast_reg.iova = ib_sg_dma_adress(&sg[0]);
> >          frwr.wr.fast_reg.length = length;
> >          frwr.wr.fast_reg.access_flags = my_flags;
> 
> Provide a helper to hide all this behind the scenes please:
> 
> void rdma_init_mr_wr(struct ib_send_wr *wr, struct rdma_mr *mr,
> 		u64 wr_id, int mr_access_flags);
>
> Or if we got with Jason's suggestion split "int mr_access_flags" into
> "bool remote, bool is_write".

Yes please. Considering the security implications we need to be much
more careful API wise here. This is more of a code-as-documentation
issue than a functional issue.

Lets avoid the issue of implicit posting, but still delegate the WR
construction to the driver:

 rdma_map_sg_lkey(u32 *lkey_out,
                  struct rdma_mr *mr,
		  const struct scatterlist *sg,
 	          unsigned short sg_nents,
		  unsigned int ops_supported,
		  struct ib_send_wr *post_wr,
		  u64 wrid)
 rdma_map_sg_rkey(.. same args ..)

Used as:

 rdma_map_sg_lkey(&wr[1].lkey,mr,sg,sg_nents,RDMA_OP_SEND,
 		 &wr[0],...);
 wr[1].opcode = IB_SEND;
 ib_post_send(wr..)

I'd probably implement the above as two wrappers around a
generic driver/core callback. All the wrappers do is translation of
ops_supports to the IB_ACCESS scheme.

The two entry points interpret their ops_supported differently
(source/sink in Steve's model), so it becomes impossible to
inadvertantly create a remote access lkey. And the API makes it
unnatural to use a MR as both a lkey and rkey.

Just thinking .. I'd probably drop the wrid and have rdma_map_sg_x
create an unsignaled wr by default. If the caller wants to force
signaling they can fill in the write and change the flags.

---------

I'm still unhappy with this, from a broad perspective. Look at what
any ULP has to implement to do a RDMA READ properly:

if (iwarp)
{
    rdma_map_sg_lkey(&wr[1].lkey,mr,sg,sg_nents,RDMA_OP_RDMA_READ,
 		 &wr[0],...);
    wr[1].opcode = IB_RDMA_READ;
    wr[2].opcode = IB_WR_LOCAL_INV;
    wr[2].invalidate_rkey = wr[1].lkey;
}
else if (sg_nents <= device->read_sg_limit)
{
    wr[0].opcode = IB_RDMA_READ;
    wr[0].lkey = pd->local_dma_lkey;
    ... translate struct sctatter_list to a wr ...
}
else if (fmr)
{
     ... ? ...
}
} else { // For IB
    rdma_map_sg_lkey(&wr[1].lkey,mr,sg,sg_nents,RDMA_OP_RDMA_READ,
 		 &wr[0],...);
    wr[1].opcode = IB_RDMA_READ;
    // We can lazy invalidate when we recycle the MR (?)
}

And that isn't even considering the possibility of using multiple
RDMA_READ ops instead of a MR. (which would be smarter that FMR)

I see the above as a common, mandatory, pattern that should be
factored.. If a ULP is doing RDMA_READ and it isn't doing the above,
it is broken. It either doesn't support iWarp, or it is throwing IB
performance away.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux