[RFC 0/5] Hardware tag matching

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Introduction
============

The MPI standard defines a set of rules, known as tag-matching, for matching
source send operations to destination receives according to the following
attributes:

* Communicator
* User tag - wild card may be specified by the receiver
* Source rank - wild card may be specified by the receiver
* Destination rank - wild card may be specified by the receiver

These matching attributes are specified by all Send and Receive operations.
Send operations from a given source to a given destination are processed in
the order in which the Sends were posted. Receive operations are associated
with the earliest send operation (from any source) that matches the
attributes, in the order in which the Receives were posted. Note that Receive
tags are not necessarily consumed in the order they are created, e.g., a later
generated tag may be consumed if earlier tags do not satisfy the matching
rules.

When a message arrives at the receiver, MPI implementations often classify it
as either 'expected' or 'unexpected' according to whether a Receive operation
with a matching tag has already been posted by the application. In the
expected case, the message may be processed immediately. In the unexpected
case, the message is saved in an unexpected message queue, and will be
processed when a matching Receive operation is posted.

To bound the amount of memory to hold unexpected messages, MPI implementations
use 2 data transfer protocols. The 'eager' protocol is used for small
messages. Eager messages are sent without any prior synchronization and
processed/buffered at the receiver. Typically, with RDMA, a single RDMA-Send
operation is used to transfer the data.

The 'rendezvous' protocol is used for large messages. Initially, only the
message tag is sent along with some meta-data. Only when the tag is matched to
a Receive operation, will the receiver initiate the corresponding data
transfer. A common RDMA implementation is to send the message tag with an
RDMA-Send, and transfer the data with an RDMA-Read issued by the receiver.
When the transfer is complete, the receiver will notify the sender that its
buffer may be freed using an RDMA-Send.

RDMA tag-matching offload
=========================

Tag-matching offload satisfies the following principals:
-   Tag-matching is viewed as an RDMA application, and thus does not affect the
    RDMA transport in any way (*)
-   Tag-matching processing will be split between HW and SW.
    *   HW will hold a bounded prefix of Receive tags
-   HW will process and transfer any expected message that matches a tag held
    in HW.
    *   In case the message uses the rendezvous protocol, HW will also initiate
	the RDMA-Read data transfer and send a notification message when the
	data transfer completes.
-   SW will handle any message that is either unexpected or whose tag is not
    held in HW.

(*) This concept can apply to additional application-specific offloads in the
future.

Tag-matching is initially defined for RC transport. Tag-matching messages are
encapsulated in RDMA-Send messages and contain the following headers:

    0			1		    2			3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   Tag Matching Header (TMH):
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |	Operation  |		      reserved			   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |			  User data (optional)			   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |				 Tag				   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |				 Tag				   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   Rendezvous Header (RVH):
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |			     Virtual Address			   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |			     Virtual Address			   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |			       Remote Key			   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |				 Length				   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Tag-matching messages always contain a TMH. An RHV is added for Rendezvous
request messages. The following message formats are defined:
-   Eager request: TMH | payload
-   Rendezvous request: TMH | RHV | optional meta-data (**)
-   Rendezvous response: TMH
Note that rendezvous data transfers are standard RDMA-Reads

(**) Rendezvous request messages may also arrive unexpected; in this case, the
message is handled in SW optionally leveraging additional meta-data passed by
the sender

As tag-matching messages are standard RDMA-Sends, no special HW support is
needed at the sender. At the receiver, we introduce a new SRQ type - a
Tag-Matching SRQ (TM-SRQ). The TM-SRQ forms the serialization point for
matching messages coming from any of the associated RC connections, and reports
all tag matching completions and events to a dedicated CQ.
2 kinds of buffers may be posted to the TM-SRQ:
-   Buffers associated with tags (tagged-buffers), which are used when a match
    is made by HW
-   Standard SRQ buffers, which are used for unexpected messages (from HW's
    perspective)
When a message is matched by HW, the payload is transferred directly to the
application buffer (both in the eager and the rendezvous case), while skipping
any TM headers. Otherwise, the entire message, including any TM headers, is
scattered to the SRQ buffer.

Since unexpected messages are handled in SW, there exists an inherent race
between the arrival of messages from the wire and posting of new tagged
buffers. For example, consider 2 incoming messages m1 and m2 and matching
buffers b1 and b2 that are posted asynchronously. If b1 is posted after m1
arrives but before m2, m1 would be delivered as an unexpected message while m2
would match b1, violating the ordering rules. Consequently, whenever HW deems
a message unexpected, tag matching must be disabled for new tags until SW and
HW synchronize. This synchronization is achieved by reporting to HW the number
of unexpected messages handled by SW (with respect to the current posted
tags). When the SW and HW are in synch, tag matching resumes normally.

This RFC describes the corresponding Verbs API and implementation for RDMA tag
matching offload.

Tag Matching Verbs
==================

Capabilities
------------

Tag matching capabilities are queried by ibv_query_device(), and report the
following attributes:

* max_rndv_hdr_size - Max size of rendezvous request message
* max_num_tags - Max number of tagged buffers in a TM-SRQ matching list
* max_ops - Max number of outstanding tag matching operations
* max_sge - Max number of SGEs in a tagged buffer
* flags - the following flags are currently defined:
    - IBV_TM_CAP_RC - Support tag matching on RC transport


TM-SRQ creation
---------------

TM-SRQs are created by the ibv_create_srq_ex() Verb, which accepts the
following new attributes:
* srq_type - set to IBV_SRQT_TAG_MATCHING
* comp_mask - set the IBV_SRQ_INIT_ATTR_TM flag
* tm_cap - TM properties for this TM-SRQ; defined as follows:

struct ibv_tm_cap {
	 uint32_t max_num_tags;   /* Matching list size */
	 uint32_t max_ops;	  /* Number of outstanding TM operations */
}

Similarly to XRC SRQs, a TM-SRQ has a dedicated CQ.

RC QPs are associated with the TM-SRQ just like standard SRQs. However, the
ownership of the QP's Send Queue is passed to the TM-SRQ, which uses it to
initiate rendezvous RDMA-Reads. Receive completions are reported to the
TM-SRQ's CQ.


Managing TM receive buffers
---------------------------

Untagged (unexpected) buffers are posted using the standard
ibv_post_srq_recv() Verb.

Tagged buffers are manipulated by a new ibv_post_srq_ops() Verb:

int ibv_post_srq_ops(struct ibv_srq *srq, struct ibv_ops_wr *wr,
		     struct ibv_ops_wr **bad_wr);

struct ibv_ops_wr {
	 uint64_t		 wr_id;    /* User defined WR ID */
	 /* Pointer to next WR in list, NULL if last WR */
	 struct ibv_ops_wr	*next;
	 enum ibv_ops_wr_opcode  opcode;   /* From enum ibv_ops_wr_opcode */
	 int			 flags;    /* From enum ibv_ops_flags */
	 struct {
		  /* Number of unexpected messages
		   * handled by SW */
		  uint32_t unexpected_cnt;
		  /* Input parameter for the DEL opcode
		   * and output parameter for the ADD opcode */
		  uint32_t handle;
		  struct {
			  /* WR ID for TM_RECV */
			  uint64_t		  recv_wr_id;
			  struct ibv_sge	 *sg_list;
			  int			  num_sge;
			  uint64_t		  tag;
			  uint64_t		  mask;
		  } add;
	 } tm;
};

The following opcodes are defined:

Opcode IBV_WR_TAG_ADD - add a tagged buffer entry to the tag matching list.
The input consists of an SGE list, a tag, a mask (matching parameters), and the
latest unexpected message count. A handle that uniquely identifies the entry is
returned upon return.

Opcode IBV_WR_TAG_DEL - delete a tag entry.
The input is an entry handle returned from a previous IBV_WR_TAG_ADD
operation, and the latest unexpected message count.

Note that the operation may fail if the associated tag was consumed by an
incoming message. In this case IBV_WC_TM_ERR status will be returned in WC.

Opcode IBV_WR_TAG_SYNC - report the number of unexpected messages handled by
SW.
The input comprises only the unexpected message count. To reduce explicit
synchronization to a minimum, all completions indicate when synchronization is
necessary by setting the IBV_WC_TM_SYNC_REQ flag.

ibv_post_srq_ops() operations are non-signaled by default. To request an
explicit completion for a given operation, the standard IBV_OPS_SIGNALED flag
must be set. The number of outstanding tag-manipulation operations must not
exceed the 'max_ops' capability.

While 'wr_id' identifies the tag manipulation operation itself, the
'recv_wr_id' field is used to identify the tagged buffer in receive
completions.


Sending TM messages
-------------------

TM messages are sent using standard RC Send operations. A TM message comprises
a Tag-Matching Header (TMH), an optional Rendezvous Header (RVH), and
payload.

TMH and RVH defined in infiniband/tm_types.h:

struct ibv_tmh {
	  uint8_t	  opcode;
	  uint8_t	  reserved[3];
	  __be32	  app_ctx;
	  __be64	  tag;
};

struct ibv_rvh {
	  __be64	  va;
	  __be32	  rkey;
	  __be32	  len;
};

The following opcodes are defined:

* IBV_TM_NO_TAG   - Send a message without a tag.
Such a message will always be treated as unexpected by the receiver TM-SRQ.
Any data following the opcode is ignored by the tag matching logic, and the
message is delivered in its entirety (including the opcode) to the standard
SRQ buffer.

* IBV_TM_OP_EAGER - Send an eager tagged message.
The message consists of a TMH followed by payload.

* IBV_TM_OP_RNDV  - Send a tagged rendezvous request.
The message consists of a TMH, an RVH, and optional additional data (which may
be inspected by receiver SW if the message is deemed unexpected). The RVH must
refer to a registered buffer containing the rendezvous payload. The total
rendezvous message size must not exceed the 'max_rndv_hdr_size' capability.
The Sender must consider the operation outstanding until a TM message with the
IBV_TM_OP_FIN opcode is received, after which the buffer may be deregistered
and freed.

* IBV_TM_OP_FIN   - Send a rendezvous completion indication.
The message consists of a copy of the original TMH and RVH of the rendezvous
request, apart the opcode. This message is sent after the receiver has
completed the transfer of the rendezvous payload by an RDMA-read operation. It
may be sent either by HW or SW, depending on whether the rendezvous request
was handled as expected or unexpected by the TM-SRQ.

TM completion processing
------------------------

There are 2 types of TM completions: tag-manipulation and receive completions.

Tag-manipulation operations generate the following completion opcodes:
* IBV_WC_TM_ADD - completion of a tag addition operation
* IBV_WC_TM_DEL - completion of a tag removal operation
* IBV_WC_TM_SYNC - completion of synchronization operation

These completions are complemented by the IBV_WC_TM_SYNC_REQ flag, which
indicates whether further HW synchronization is needed.

TM receive completions generate the following completion codes:
* IBV_WC_RECV - standard SRQ completion; used for unexpected messages
* IBV_WC_TM_NO_TAG - completion of a message sent with the IBV_TM_NO_TAG opcode.
* IBV_WC_TM_RECV - completion of a tag-matching operation

The IBV_WC_TM_RECV completion is complemented by the following completion flags:
-	IBV_WC_TM_MATCH - a match was performed
-	IBV_WC_TM_DATA_VALID - all data of the matched message has been
	delivered to memory

In single-packet eager messages, both flags are set. When larger messages or
rendezvous transfers are involved, matching and data transfer completion are
distinct events that generate 2 completion events for the same 'recv_wr_id'.
While data transfer completions may be arbitrarily delayed depending on
message size, matching completion is reported immediately and is always
serialized with respect to other matches and the completion of unexpected
messages.

In addition, IBV_WC_TM_RECV completions provide further information about the
matched message. This information is obtained using extended CQ processing via
the following extractor function:

static inline void ibv_wc_read_tm_info(struct ibv_cq_ex *cq,
				       struct ibv_wc_tm_info *tm_info);

struct ibv_wc_tm_info {
	  uint64_t		  tag;	   /* Tag information */
	  uint32_t		  priv;    /* Application context */
};

Finally, when a posted tagged buffer is insufficient to hold the data of a
rendezvous request, the HW completes the buffer with an
IBV_WC_TM_RNDV_INCOMPLETE status. In this case, the TMH and RVH headers are
scattered into the tagged buffer (tag-matching has still been completed!), and
message handling is resumed by SW.

Artemy Kovalyov (5):
  verbs: Expose tag matching capabilities
  verbs: Introduce tag matching SRQ
  verbs: Tag matching list manipulation interface
  verbs: Tag matching send interface
  verbs: Tag matching receive interface

 libibverbs/tm_types.h |  70 ++++++++++++++++++++++++++++++++
 libibverbs/verbs.h    | 110 +++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 174 insertions(+), 6 deletions(-)
 create mode 100644 libibverbs/tm_types.h

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux