RE: [RFC] Registering non-contiguous memory

Alex Margolin <alexma@xxxxxxxxxxxx> · Wed, 1 Nov 2017 11:11:42 +0000

We're planning to start the implementation soon - now is the last opportunity for input before we do.
Please let us know of any comments or suggestions you have.

-----Original Message-----
From: Alex Margolin 
Sent: Sunday, February 05, 2017 6:46 PM
To: linux-rdma@xxxxxxxxxxxxxxx
Subject: [RFC] Registering non-contiguous memory

Introduction
--------------------------------------------------------------------------------
Numerous applications communicate buffers with a non-contiguous memory layout.
For example, HPC applications often work on a matrix, and require sending a row or a column. Sending non-contiguous data requires specifying all the buffers being transferred, along with potentially the same amount of memory keys (should the buffers be registered under different memory regions). Extended memory windows are proposed to address complex memory layouts, and allow the user to send and receive them in an efficient manner.
This RFC proposes an extension for memory windows for the use of non-contiguous buffers.

The problem
--------------------------------------------------------------------------------
Currently, this is implemented using "sge_list" with an entry for each buffer, but sending a regular structure may cause a multitude of entries - which can easily be represented in a more compact way. For example, a matrix column:
               M
          -----------
         | |X| | | | |
          -----------
         | |X| | | | |
          -----------   N
         | |X| | | | |
          -----------
         | |X| | | | |
          -----------

In the current implementation, sending such a column would require N scatter- gather entries, each with the same length and within equal distance from the previous one.

A similar case is data spanning across multiple memory regions:



                     ----------
                    |          |
                    |          |
                    | Memory   |
                   /| region #1|
     "Composite   / |          |
       region"   /  |          |
     ---------- /  -|          |
    |          |  / |          |
    |          | /   ----------
     ---------- <
    |          | \   ---------- 
    |          |  --|          |
    |          |    | Memory   |
    |          | ___| region #2|
     ---------- <   |          |
    |          | \   ----------
     ----------\  \_ ----------
                \   |          |
                 \  | Memory   |
                  \-| region #N|
                    |          |
                    |          |
                     ----------

Currently, sending such a "composite region" involves specifying all N regions it spans across. 

The problem is that a large amount of information needs to be passed with the WR to communicate a potentially small amount of data, causing significant overhead.
This is especially true if the layout information does not fit in a single WR
(inline) and extra overhead is incurred. Also, applications tend to re-use the same layout over a multitude of network operations, thus aggravating this problem.

Potentially, one may want to use more than one dimension of stride, like the case of a side of a multi-dimensional cube (M * N * O):

                              O
                          ___ ___ ___
                        /_X_/_X_/_X_/|
                    N  /_X_/_X_/_X_/||
                      /_X_/_X_/_X /|/|
                     |   |   |   | /||
                     |___|___|___|/|/|
                  M  |   |   |   | /||
                     |___|___|___|/|/
                     |   |   |   | /
                     |___|___|___|/

Suggested Solution
--------------------------------------------------------------------------------
The suggested solution is to extend the existing API of memory windows to allow the user to describe a memory layout: a compact description of patterns of memory - which can later be used to transfer buffers according to them.

For the aforementioned matrix example - the layout will include the size of a single cell in the matrix, the distance between cells (M times that size) and the total amount of buffers (N). For the composite example, the layout will simply contain the list of memory regions along with a base-pointer and length.

Suggested API Details
--------------------------------------------------------------------------------
The main addition is the layout structure, composed of an array of entries.
Each entry refers to a previously created memory region or memory window, which allows the formation of multi-level memory windows for complex structures.
Each entry can be a contiguous area, creating a composite, or a quantity of equally-distanced buffers (but not a mixture of the two). The dimension of the strides is accommodated by an variable-length array of item counts and stride interval sizes. An additional per-entry field enables non-uniform interleaving, where the ratio is not 1:1, so we may take two items from the first entry for each item of the second entry.

The layout is "switched on" as a bind parameter with an additional access flag, which is somewhat of a misuse, but since struct ibv_mw_bind_info is not easily expandable in the functions using it, the alternative is an additional verb.

We do propose adding a verb for the allocation of such memory windows, which is required since the original memory window allocation function is not easily extendable, and we require an additional parameter to determine the amount of resources to allocate towards it (descriptors). In addition, this number is available for existing windows, in order to determine how to set this for a composite of two existing windows, or to keep track of resource use for it.

Finally, we add the option to obtain a local memory key for memory windows, aside from the currently available remote key. This can be specified in an SGE to send data out of a memory layout. This key will only be valid if local access is requested in the access flags.



Signed-off-by: Alex Margolin <alexma@xxxxxxxxxxxx>
---

diff --git a/include/infiniband/driver.h b/include/infiniband/driver.h index ea3dade..9ef3639 100644
--- a/include/infiniband/driver.h
+++ b/include/infiniband/driver.h
@@ -96,6 +96,20 @@ struct verbs_qp {
 	uint32_t		comp_mask;
 	struct verbs_xrcd       *xrcd;
 };
+
+enum verbs_mw_mask {
+	VERBS_MW_LKEY		= 1 << 0,
+	VERBS_MW_DESCRIPTOR_NUM	= 1 << 1,
+	VERBS_MW_RESERVED	= 1 << 2
+};
+
+struct verbs_mw {
+	struct ibv_mw		mw;
+	uint32_t		comp_mask;
+	uint32_t		lkey;
+	uint32_t		descriptor_num;
+};
+
 typedef struct ibv_device *(*ibv_driver_init_func)(const char *uverbs_sys_path,
 						   int abi_version);
 typedef struct verbs_device *(*verbs_driver_init_func)(const char *uverbs_sys_path, diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h index e994c21..748dc60 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -214,6 +214,23 @@ struct ibv_tso_caps {
 	uint32_t supported_qpts;
 };
 
+struct ibv_ncm_caps {
+	uint64_t general_caps;
+	uint32_t max_mkey_ims_list_size;
+	uint32_t max_send_wqe_inline_cms;
+	uint32_t max_send_wqe_inline_ims;
+	uint32_t max_mw_recursion_depth;
+	uint32_t max_mw_stride_dimenson;
+};
+
+enum ibv_ncm_general_caps {
+	IBV_NCM_SUPPORT_COMPOSITE = 1 << 0,
+	IBV_NCM_SUPPORT_INTERLEAVED = 1 << 1,
+	IBV_NCM_SUPPORT_INTERLEAVED_REPETITION = 1 << 2,
+	IBV_NCM_SUPPORT_INTERLEAVED_NONUNIFORM_REPETITION = 1 << 3,
+	IBV_NCM_SUPPORT_INTERLEAVED_NONUNIFORM_TOTAL_ITEMS = 1 << 4, };
+
 /* RX Hash function flags */
 enum ibv_rx_hash_function_flags {
 	IBV_RX_HASH_FUNC_TOEPLITZ	= 1 << 0,
@@ -256,6 +273,7 @@ struct ibv_device_attr_ex {
 	struct ibv_tso_caps	tso_caps;
 	struct ibv_rss_caps     rss_caps;
 	uint32_t		max_wq_type_rq;
+	struct ibv_ncm_caps     ncm_caps;
 };
 
 enum ibv_mtu {
@@ -472,13 +490,7 @@ enum ibv_access_flags {
 	IBV_ACCESS_MW_BIND		= (1<<4),
 	IBV_ACCESS_ZERO_BASED		= (1<<5),
 	IBV_ACCESS_ON_DEMAND		= (1<<6),
-};
-
-struct ibv_mw_bind_info {
-	struct ibv_mr	*mr;
-	uint64_t	 addr;
-	uint64_t	 length;
-	int		 mw_access_flags; /* use ibv_access_flags */
+	IBV_ACCESS_BIND_MW_NONCONTIG	= (1<<7),
 };
 
 struct ibv_pd {
@@ -533,6 +545,61 @@ struct ibv_mw {
 	enum ibv_mw_type	type;
 };
 
+struct ibv_mw_alloc_attr_ex {
+	unit32_t comp_mask;
+	struct ibv_pd *pd;
+	enum ibv_mw_type type;
+	int max_descriptors;
+};
+
+enum ibv_mw_entry_type {
+	IBV_MW_ENTRY_USE_MR = 0,
+	IBV_MW_ENTRY_USE_MW
+};
+
+struct ibv_mw_bind_layout_entry {
+	uint32_t comp_mask;
+	enum ibv_mw_entry_type mem_obj_type;
+	union mem_obj {
+		struct ibv_mr *mr;
+		struct ibv_mw *mw;
+	};
+	uint64_t addr;
+	uint64_t length;
+	struct {
+		uint64_t repeat_count;
+		uint32_t dimension_count;
+		struct {
+			uint64_t stride;
+			uint64_t count;
+		} *dimension;
+	} interleaved;
+};
+
+enum ibv_mw_bind_layout_type {
+	IBV_MW_BIND_LAYOUT_TYPE_COMPOSITE = 0,
+	IBV_MW_BIND_LAYOUT_TYPE_INTERLEAVED
+};
+
+struct ibv_mw_bind_layout {
+	uint32_t comp_mask;
+	enum ibv_mw_bind_layout_type type;
+	uint32_t entry_count;
+	struct ibv_mw_bind_info_entry *entries; };
+
+struct ibv_mw_bind_info {
+	union {
+		struct {
+			struct ibv_mr   *mr;
+			uint64_t        addr;
+			uint64_t        length;
+		};
+		struct ibv_mw_bind_layout *layout;
+	};
+	int		 mw_access_flags; /* use ibv_access_flags */
+};
+
 struct ibv_global_route {
 	union ibv_gid		dgid;
 	uint32_t		flow_label;
@@ -1411,11 +1478,13 @@ enum verbs_context_mask {
 	VERBS_CONTEXT_QP	= 1 << 2,
 	VERBS_CONTEXT_CREATE_FLOW = 1 << 3,
 	VERBS_CONTEXT_DESTROY_FLOW = 1 << 4,
-	VERBS_CONTEXT_RESERVED	= 1 << 5
+	VERBS_CONTEXT_ALLOC_MW	= 1 << 5,
+	VERBS_CONTEXT_RESERVED	= 1 << 6
 };
 
 struct verbs_context {
 	/*  "grows up" - new fields go here */
+	struct ib_mw * (*alloc_mw_ex)(struct ibv_mw_alloc_attr 
+*mw_alloc_attr);
 	int (*destroy_rwq_ind_table)(struct ibv_rwq_ind_table *rwq_ind_table);
 	struct ibv_rwq_ind_table *(*create_rwq_ind_table)(struct ibv_context *context,
 							  struct ibv_rwq_ind_table_init_attr *init_attr);
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html