On Wed, May 11, 2022 at 5:59 PM Xie Yongji <xieyongji@xxxxxxxxxxxxx> wrote: > > Hi all, > Not very familiar with ROCE, try to give some comments from general virtio level. > This RFC aims to introduce our recent work on enabling RoCE support > for virtio-net device. We need to clarify the version of ROCE, is it ROCEv2 or not? > > To support RoCE, three types of virtqueues including RDMA send virtqueue, > RDMA receive virtqueue and RDMA completion virtqueue are introduced. > And control virtqueue is reused to support the RDMA control messages. > > Now we support some basic RDMA semantics such as send/receive > and read/write operation. It would be better to explain the advantages of this over the existing pvrdma approach. I guess one advantage is that using virtio makes it easier to connect to a userspace dataplane through vDPA/vhost-user? > > To test with our demo: > > 1. Build Guest kernel [1] with config INFINIBAND_VIRTIO_RDMA > > 2. Build QEMU [2] with config VHOST_USER_RDMA > > 3. Build rdma-core [3] > > 4. Build and install DPDK (NOTE that we only tested on DPDK 20.11.3) > > 5. Build vhost-user-rdma [4] > > 6. Run vhost-user-rdma with command: > $ ./vhost-user-rdma --vdev 'net_tap0' --lcore '1-3' -- -s '/tmp/vhost-rdma0' > > 7. Run qemu with command: > $ qemu-system-x86_64 -chardev socket,path=/tmp/vhost-rdma0,id=vrdma \ > -device vhost-user-rdma-pci,page-per-vq,chardev=vrdma ... It would be better to give some performance numbers (or even compare it with pvrdma). > > [1] https://github.com/bytedance/linux/tree/virtio-net-roce > [2] https://github.com/bytedance/qemu/tree/vhost-user-rdma > [3] https://github.com/YongjiXie/rdma-core/tree/virtio-rdma > [4] https://github.com/YongjiXie/vhost-user-rdma > > We have already tested it with ibv_rc_pingpong, ibv_ud_pingpong and some > others in rdma-core. > > TODO: > And we'd better consider the live migration support. Having a quick glance, it looks to me trapping the cvq is sufficient? > 1. Add support for Base Memory Management Extensions > > 2. Add support for atomic operation > > 3. Add support for SRQ > > 4. Add support for virtqueue resize Note that this is already supported by the spec via virtqueue reset. > > 5. Add support for enabling/disabling virtqueue at runtime I guess virtqueue reset could help in this case. > > Please review, thanks! > > V1 to V2: > - Rework the implementation via extending virtio-net instead of > introducing a new device type [Jason] > - Add address handle support > > Signed-off-by: Xie Yongji <xieyongji@xxxxxxxxxxxxx> > Co-developed-by: Wei Junji <weijunji@xxxxxxxxxxxxx> > Signed-off-by: Wei Junji <weijunji@xxxxxxxxxxxxx> > --- > content.tex | 858 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 854 insertions(+), 4 deletions(-) I wonder if there's some open-source ROCE transport device API that we can re-use then we can just behave like a transport layer instead of inventing new commands. > > diff --git a/content.tex b/content.tex > index 7508dd1..646d82a 100644 > --- a/content.tex > +++ b/content.tex > @@ -3008,7 +3008,10 @@ \section{Network Device}\label{sec:Device Types / Network Device} > placed in one virtqueue for receiving packets, and outgoing > packets are enqueued into another for transmission in that order. > A third command queue is used to control advanced filtering > -features. > +features. And if RoCE (RDMA over Converged Ethernet) capability > +is enabled, the virtio network device can also support transmitting > +and receiving RDMA message through RDMA send virtqueue, RDMA receive > +virtqueue and RDMA completion virtqueue. > > \subsection{Device ID}\label{sec:Device Types / Network Device / Device ID} > > @@ -3023,13 +3026,24 @@ \subsection{Virtqueues}\label{sec:Device Types / Network Device / Virtqueues} > \item[2(N-1)] receiveqN > \item[2(N-1)+1] transmitqN > \item[2N] controlq > +\item[2N+1] rdma_completeq1 > +\item[\ldots] > +\item[2N+M] rdma_completeqM > +\item[2N+M+1] rdma_transmitq1 > +\item[2N+M+2] rdma_receiveq1 > +\item[\ldots] > +\item[2N+M+2L-1] rdma_transmitqL > +\item[2N+M+2L] rdma_receiveqL > \end{description} > > N=1 if neither VIRTIO_NET_F_MQ nor VIRTIO_NET_F_RSS are negotiated, otherwise N is set by > - \field{max_virtqueue_pairs}. > + \field{max_virtqueue_pairs}. M is set by \field{max_rdma_cqs} and L is set by > + \field{max_rdma_qps}. > > controlq only exists if VIRTIO_NET_F_CTRL_VQ set. > > + rdma_completeq, rdma_transmitq and rdma_receiveq only exist if VIRTIO_NET_F_ROCE set > + > \subsection{Feature bits}\label{sec:Device Types / Network Device / Feature bits} > > \begin{description} > @@ -3084,6 +3098,9 @@ \subsection{Feature bits}\label{sec:Device Types / Network Device / Feature bits > \item[VIRTIO_NET_F_CTRL_MAC_ADDR(23)] Set MAC address through control > channel. > > +\item[VIRTIO_NET_F_ROCE(55)] Device supports RoCE (RDMA over Converged Ethernet) > + capability. > + > \item[VIRTIO_NET_F_HOST_USO (56)] Device can receive USO packets. Unlike UFO > (fragmenting the packet) the USO splits large UDP packet > to several segments when each of these smaller packets has UDP header. > @@ -3129,6 +3146,7 @@ \subsubsection{Feature bit requirements}\label{sec:Device Types / Network Device > \item[VIRTIO_NET_F_GUEST_ANNOUNCE] Requires VIRTIO_NET_F_CTRL_VQ. > \item[VIRTIO_NET_F_MQ] Requires VIRTIO_NET_F_CTRL_VQ. > \item[VIRTIO_NET_F_CTRL_MAC_ADDR] Requires VIRTIO_NET_F_CTRL_VQ. > +\item[VIRTIO_NET_F_ROCE] Requires VIRTIO_NET_F_CTRL_VQ. > \item[VIRTIO_NET_F_RSC_EXT] Requires VIRTIO_NET_F_HOST_TSO4 or VIRTIO_NET_F_HOST_TSO6. > \item[VIRTIO_NET_F_RSS] Requires VIRTIO_NET_F_CTRL_VQ. > \end{description} > @@ -3190,6 +3208,8 @@ \subsection{Device configuration layout}\label{sec:Device Types / Network Device > u8 rss_max_key_size; > le16 rss_max_indirection_table_length; > le32 supported_hash_types; > + le32 max_rdma_qps; > + le32 max_rdma_cps; > }; > \end{lstlisting} > The following field, \field{rss_max_key_size} only exists if VIRTIO_NET_F_RSS or VIRTIO_NET_F_HASH_REPORT is set. > @@ -3204,11 +3224,23 @@ \subsection{Device configuration layout}\label{sec:Device Types / Network Device > Field \field{supported_hash_types} contains the bitmask of supported hash types. > See \ref{sec:Device Types / Network Device / Device Operation / Processing of Incoming Packets / Hash calculation for incoming packets / Supported/enabled hash types} for details of supported hash types. > > +Field \field{max_rdma_qps} only exists if VIRTIO_NET_F_ROCE is set. > +It specifies the maximum number of queue pairs (send virtqueue and receive virtqueue) for RoCE usage. > + > +Field \field{max_rdma_cqs} only exists if VIRTIO_NET_F_ROCE is set. > +It specifies the maximum number of completion virtqueue for RoCE usage. > + > \devicenormative{\subsubsection}{Device configuration layout}{Device Types / Network Device / Device configuration layout} > > The device MUST set \field{max_virtqueue_pairs} to between 1 and 0x8000 inclusive, > if it offers VIRTIO_NET_F_MQ. > > +The device MUST set \field{max_rdma_qps} to between 1 an 16384 inclusive, > +if it offers VIRTIO_NET_F_ROCE. I wonder why 16384 is chosen here? > + > +The device MUST set \field{max_rdma_cqs} to between 1 an 16384 inclusive, > +if it offers VIRTIO_NET_F_ROCE. > + > The device MUST set \field{mtu} to between 68 and 65535 inclusive, > if it offers VIRTIO_NET_F_MTU. > > @@ -3306,6 +3338,12 @@ \subsection{Device Initialization}\label{sec:Device Types / Network Device / Dev > \item If the VIRTIO_NET_F_CTRL_VQ feature bit is negotiated, > identify the control virtqueue. > > +\item If the VIRTIO_NET_F_ROCE feature bit is negotiated, > + identify the the RDMA completion virtqueues, up to max_rdma_cqs. > + > +\item If the VIRTIO_NET_F_ROCE feature bit is negotiated, > + identify the the RDMA send and receive virtqueues, up to max_rdma_qps. > + > \item Fill the receive queues with buffers: see \ref{sec:Device Types / Network Device / Device Operation / Setting Up Receive Buffers}. > > \item Even with VIRTIO_NET_F_MQ, only receiveq1, transmitq1 and > @@ -4007,6 +4045,7 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi > u8 command; > u8 command-specific-data[]; > u8 ack; > + u8 ack-specific-data[]; > }; > > /* ack values */ > @@ -4015,8 +4054,8 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi > \end{lstlisting} > > The \field{class}, \field{command} and command-specific-data are set by the > -driver, and the device sets the \field{ack} byte. There is little it can > -do except issue a diagnostic if \field{ack} is not > +driver, and the device sets the \field{ack} byte and ack-specific-data. There > +is little it can do except issue a diagnostic if \field{ack} is not > VIRTIO_NET_OK. > > \paragraph{Packet Receive Filtering}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / Packet Receive Filtering} > @@ -4463,6 +4502,534 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi > according to the native endian of the guest rather than > (necessarily when not using the legacy interface) little-endian. > > +\paragraph{RoCE Configuration}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / RoCE Configuration} > + > +If the driver negotiates the VIRTIO_NET_F_ROCE feature bit (depends on VIRTIO_NET_F_CTRL_VQ), > +it can send control commands for RoCE usage. The following commands are defined now: > + > +\begin{lstlisting} > +#define VIRTIO_NET_CTRL_ROCE 6 > + #define VIRTIO_NET_CTRL_ROCE_QUERY_DEVICE 0 > + #define VIRTIO_NET_CTRL_ROCE_QUERY_PORT 1 > + #define VIRTIO_NET_CTRL_ROCE_CREATE_CQ 2 > + #define VIRTIO_NET_CTRL_ROCE_DESTROY_CQ 3 > + #define VIRTIO_NET_CTRL_ROCE_CREATE_PD 4 > + #define VIRTIO_NET_CTRL_ROCE_DESTROY_PD 5 > + #define VIRTIO_NET_CTRL_ROCE_GET_DMA_MR 6 > + #define VIRTIO_NET_CTRL_ROCE_REG_USER_MR 7 > + #define VIRTIO_NET_CTRL_ROCE_DEREG_MR 8 > + #define VIRTIO_NET_CTRL_ROCE_CREATE_QP 9 > + #define VIRTIO_NET_CTRL_ROCE_MODIFY_QP 10 > + #define VIRTIO_NET_CTRL_ROCE_QUERY_QP 11 > + #define VIRTIO_NET_CTRL_ROCE_DESTROY_QP 12 > + #define VIRTIO_NET_CTRL_ROCE_CREATE_AH 13 > + #define VIRTIO_NET_CTRL_ROCE_DESTROY_AH 14 > + #define VIRTIO_NET_CTRL_ROCE_ADD_GID 15 > + #define VIRTIO_NET_CTRL_ROCE_DEL_GID 16 > + #define VIRTIO_NET_CTRL_ROCE_REQ_NOTIFY_CQ 17 > +\end{lstlisting} > + > +\begin{description} > +\item[VIRTIO_NET_CTRL_ROCE_QUERY_DEVICE] Query the attributes of device. > + No command-specific-data; > + the ack-specific-data is \field{struct virtio_rdma_ack_query_device}. > + > +\begin{lstlisting} > +struct virtio_rdma_ack_query_device { > +#define VIRTIO_IB_DEVICE_RC_RNR_NAK_GEN (1 << 0) What's the meaning of this capability? > + /* Capabilities mask */ > + le64 device_cap_flags; Will this introduce a migration compatibility issue? E.g src and dst have the same features but different capabilities. > + /* Largest contiguous block that can be registered */ > + le64 max_mr_size; > + /* Supported memory shift sizes */ > + le64 page_size_cap; > + /* Hardware version */ > + le32 hw_ver; What did "hardware version" mean? Is this something that is defined in the IB spec? > + /* Maximum number of outstanding Work Requests (WR) on Send Queue (SQ) and Receive Queue (RQ) */ > + le32 max_qp_wr; Is this implied in the virtqueue size? If not, why? > + /* Maximum number of scatter/gather (s/g) elements per WR for SQ for non RDMA Read operations */ > + le32 max_send_sge; > + /* Maximum number of s/g elements per WR for RQ for non RDMA Read operations */ > + le32 max_recv_sge; > + /* Maximum number of s/g per WR for RDMA Read operations */ > + le32 max_sge_rd; > + /* Maximum size of Completion Queue (CQ) */ > + le32 max_cqe; Need to specify the reason why we can't use the virtqueue size for the completion queue. > + /* Maximum number of Memory Regions (MR) */ > + le32 max_mr; > + /* Maximum number of Protection Domains (PD) */ > + le32 max_pd; > + /* Maximum number of RDMA Read perations that can be outstanding per Queue Pair (QP) */ I guess you mean "operations" here. > + le32 max_qp_rd_atom; > + /* Maximum depth per QP for initiation of RDMA Read operations */ The member has an "atom" suffix, does it mean "atomic read" or other? > + le32 max_qp_init_rd_atom; > + /* Maximum number of Address Handles (AH) */ > + le32 max_ah; > + /* Local CA ack delay */ > + u8 local_ca_ack_delay; > + /* Padding */ > + u8 padding[3]; > + /* Reserved for future */ > + le32 reserved[14]; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_QUERY_PORT] Query the attributes of port. > + No command-specific-data; > + the ack-specific-data is \field{struct virtio_rdma_ack_query_port}. > + > +\begin{lstlisting} > +struct virtio_rdma_ack_query_port { > + /* Length of source Global Identifier (GID) table */ > + le32 gid_tbl_len; > + /* Maximum message size */ > + le32 max_msg_sz; I guess this is for both read/write/send/receive? And is 4GB sufficient for the future? > + /* Reserved for future */ > + le32 reserved[6]; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_CREATE_CQ] Create a Completion Queue (CQ). > + The command-specific-data is \field{struct virtio_rdma_cmd_create_cq}; > + the ack-specific-data is \field{struct virtio_rdma_ack_create_cq}. > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_create_cq { > + /* Size of CQ */ > + le32 cqe; > +}; > + > +struct virtio_rdma_ack_create_cq { > + /* The index of CQ */ > + le32 cqn; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_DESTROY_CQ] Destroy a Completion Queue. > + The command-specific-data is \field{struct virtio_rdma_cmd_destroy_cq}; > + no ack-specific-data. > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_destroy_cq { > + /* The index of CQ */ > + le32 cqn; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_CREATE_PD] Create a Protection Domain (PD). > + No command-specific-data; > + the ack-specific-data is \field{struct virtio_rdma_ack_create_pd}. > + > +\begin{lstlisting} > +struct virtio_rdma_ack_create_pd { > + /* The handle of PD */ > + le32 pdn; > +}; > +\end{lstlisting} Can this command always succeed? I meant is there a limit of the total number of PDs that a single ROCE device can support? > + > +\item[VIRTIO_NET_CTRL_ROCE_DESTORY_PD] Destroy a Protection Domain. > + The command-specific-data is \field{virtio_rdma_cmd_destroy_pd}; > + no ack-specific-data. > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_destroy_pd { > + /* The handle of PD */ > + le32 pdn; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_GET_DMA_MR] Get the DMA Memory Region (MR). > + associated with one protection domain. I wonder what's the difference between VIRTIO_NET_CTRL_ROCE_GET_DMA_MR and USR_MR. Can we unify them? > + The command-specific-data is \field{virtio_rdma_cmd_get_dma_mr}; > + the ack-specific-data is \field{virtio_rdma_ack_get_dma_mr}. > + > +\begin{lstlisting} > +enum virtio_ib_access_flags { > + VIRTIO_IB_ACCESS_LOCAL_WRITE = (1 << 0), Is LOCAL_READ implied to work always? > + VIRTIO_IB_ACCESS_REMOTE_WRITE = (1 << 1), > + VIRTIO_IB_ACCESS_REMOTE_READ = (1 << 2), > +}; > + > +struct virtio_rdma_cmd_get_dma_mr { > + /* The handle of PD which the MR associated with */ > + le32 pdn; > + /* MR's protection attributes, enum virtio_ib_access_flags */ > + le32 access_flags; > +}; > + > +struct virtio_rdma_ack_get_dma_mr { > + /* The handle of MR */ > + le32 mrn; > + /* MR's local access key */ > + le32 lkey; > + /* MR's remote access key */ > + le32 rkey; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_REG_USER_MR] Register a user Memory Region > + associated with one Protection Domain. > + The command-specific-data is \field{virtio_rdma_cmd_reg_user_mr}; > + the ack-specific-data is \field{virtio_rdma_ack_reg_user_mr}. > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_reg_user_mr { > + /* The handle of PD which the MR associated with */ > + le32 pdn; > + /* MR's protection attributes, enum virtio_ib_access_flags */ > + le32 access_flags; > + /* Starting virtual address of MR */ > + le64 virt_addr; I guess this is actually the I/O virtual address and the device is in charge of translate it to the page arrays below? > + /* Length of MR */ > + le64 length; > + /* Size of the below page array */ > + le32 npages; > + /* Padding */ > + le32 padding; > + /* Array to store physical address of each page in MR */ > + le64 pages[]; How do device know the size of a page? > +}; I believe this command can fail, we need to describe the error conditions. > + > +struct virtio_rdma_ack_reg_user_mr { > + /* The handle of MR */ > + le32 mrn; > + /* MR's local access key */ > + le32 lkey; > + /* MR's remote access key */ > + le32 rkey; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_DEREG_MR] De-register a Memory Region. > + The command-specific-data is \field{virtio_rdma_cmd_dereg_mr}; > + no ack-specific-data. > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_dereg_mr { > + /* The handle of MR */ > + le32 mrn; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_CREATE_QP] Create a Queue Pair (Send Queue and Receive Queue). > + The command-specific-data is \field{virtio_rdma_cmd_create_qp}; > + the ack-specific-data is \field{virtio_rdma_ack_create_qp}. > + > +\begin{lstlisting} > +struct virtio_rdma_qp_cap { > + /* Maximum number of outstanding WRs in SQ */ > + le32 max_send_wr; > + /* Maximum number of outstanding WRs in RQ */ > + le32 max_recv_wr; > + /* Maximum number of s/g elements per WR in SQ */ > + le32 max_send_sge; > + /* Maximum number of s/g elements per WR in RQ */ > + le32 max_recv_sge; > + /* Maximum number of data (bytes) that can be posted inline to SQ */ > + le32 max_inline_data; > + /* Padding */ > + le32 padding; > +}; > + > +struct virtio_rdma_cmd_create_qp { > + /* The handle of PD which the QP associated with */ > + le32 pdn; > +#define VIRTIO_IB_QPT_SMI 0 > +#define VIRTIO_IB_QPT_GSI 1 > +#define VIRTIO_IB_QPT_RC 2 > +#define VIRTIO_IB_QPT_UC 3 > +#define VIRTIO_IB_QPT_UD 4 > + /* QP's type */ > + u8 qp_type; > + /* If set, each WR submitted to the SQ generates a completion entry */ > + u8 sq_sig_all; > + /* Padding */ > + u8 padding[2]; > + /* The index of CQ which the SQ associated with */ > + le32 send_cqn; > + /* The index of CQ which the RQ associated with */ > + le32 recv_cqn; > + /* QP's capabilities */ > + struct virtio_rdma_qp_cap cap; > + /* Reserved for future */ > + le32 reserved[4]; > +}; > + > +struct virtio_rdma_ack_create_qp { > + /* The index of QP */ > + le32 qpn; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_MODIFY_QP] Modify the attributes of a Queue Pair. > + The command-specific-data is \field{virtio_rdma_cmd_modify_qp}; > + no ack-specific-data. > + > +\begin{lstlisting} > +struct virtio_rdma_global_route { > + /* Destination GID or MGID */ > + u8 dgid[16]; > + /* Flow label */ > + le32 flow_label; > + /* Source GID index */ > + u8 sgid_index; > + /* Hop limit */ > + u8 hop_limit; > + /* Traffic class */ > + u8 traffic_class; > + /* Padding */ > + u8 padding; > +}; > + > +struct virtio_rdma_ah_attr { > + /* Global Routing Header (GRH) attributes */ > + virtio_rdma_global_route grh; > + /* Destination MAC address */ > + u8 dmac[6]; > + /* Reserved for future */ > + u8 reserved[10]; > +}; > + > +enum virtio_ib_qp_attr_mask { > + VIRTIO_IB_QP_STATE = (1 << 0), > + VIRTIO_IB_QP_CUR_STATE = (1 << 1), > + VIRTIO_IB_QP_ACCESS_FLAGS = (1 << 2), > + VIRTIO_IB_QP_QKEY = (1 << 3), > + VIRTIO_IB_QP_AV = (1 << 4), > + VIRTIO_IB_QP_PATH_MTU = (1 << 5), > + VIRTIO_IB_QP_TIMEOUT = (1 << 6), > + VIRTIO_IB_QP_RETRY_CNT = (1 << 7), > + VIRTIO_IB_QP_RNR_RETRY = (1 << 8), > + VIRTIO_IB_QP_RQ_PSN = (1 << 9), > + VIRTIO_IB_QP_MAX_QP_RD_ATOMIC = (1 << 10), > + VIRTIO_IB_QP_MIN_RNR_TIMER = (1 << 11), > + VIRTIO_IB_QP_SQ_PSN = (1 << 12), > + VIRTIO_IB_QP_MAX_DEST_RD_ATOMIC = (1 << 13), > + VIRTIO_IB_QP_CAP = (1 << 14), > + VIRTIO_IB_QP_DEST_QPN = (1 << 15), > + VIRTIO_IB_QP_RATE_LIMIT = (1 << 16), > +}; Do we need to explain the above error codes? Or it's simply a map from IB spec? > + > +enum virtio_ib_qp_state { > + VIRTIO_IB_QPS_RESET, > + VIRTIO_IB_QPS_INIT, > + VIRTIO_IB_QPS_RTR, > + VIRTIO_IB_QPS_RTS, > + VIRTIO_IB_QPS_SQD, > + VIRTIO_IB_QPS_SQE, > + VIRTIO_IB_QPS_ERR > +}; > + > +enum virtio_ib_mtu { > + VIRTIO_IB_MTU_256 = 1, > + VIRTIO_IB_MTU_512 = 2, > + VIRTIO_IB_MTU_1024 = 3, > + VIRTIO_IB_MTU_2048 = 4, > + VIRTIO_IB_MTU_4096 = 5 > +}; > + > +struct virtio_rdma_cmd_modify_qp { > + /* The index of QP */ > + le32 qpn; > + /* The mask of attributes needs to be modified, enum virtio_ib_qp_attr_mask */ > + le32 attr_mask; > + /* Move the QP to this state, enum virtio_ib_qp_state */ > + u8 qp_state; > + /* Current QP state, enum virtio_ib_qp_state */ > + u8 cur_qp_state; > + /* Path MTU (valid only for RC/UC QPs), enum virtio_ib_mtu */ > + u8 path_mtu; > + /* Number of outstanding RDMA Read operations on destination QP (valid only for RC QPs) */ > + u8 max_rd_atomic; > + /* Number of responder resources for handling incoming RDMA Read operations (valid only for RC QPs) */ > + u8 max_dest_rd_atomic; > + /* Minimum RNR (Receiver Not Ready) NAK timer (valid only for RC QPs) */ > + u8 min_rnr_timer; > + /* Local ack timeout (valid only for RC QPs) */ > + u8 timeout; > + /* Retry count (valid only for RC QPs) */ > + u8 retry_cnt; > + /* RNR retry (valid only for RC QPs) */ > + u8 rnr_retry; > + /* Padding */ > + u8 padding[7]; > + /* Q_Key for the QP (valid only for UD QPs) */ > + le32 qkey; > + /* PSN for RQ (valid only for RC/UC QPs) */ > + le32 rq_psn; > + /* PSN for SQ */ > + le32 sq_psn; > + /* Destination QP number (valid only for RC/UC QPs) */ > + le32 dest_qp_num; > + /* Mask of enabled remote access operations (valid only for RC/UC QPs), enum virtio_ib_access_flags */ > + le32 qp_access_flags; > + /* Rate limit in kbps for packet pacing */ > + le32 rate_limit; > + /* QP capabilities */ > + struct virtio_rdma_qp_cap cap; > + /* Address Vector (valid only for RC/UC QPs) */ > + struct virtio_rdma_ah_attr ah_attr; > + /* Reserved for future */ > + le32 reserved[4]; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_QUERY_QP] Query the attributes of a Queue Pair. > + The command-specific-data is \field{virtio_rdma_cmd_query_qp}; > + the ack-specific-data is \field{virtio_rdma_ack_query_qp}. > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_query_qp { > + /* The index of QP */ > + le32 qpn; > + /* The mask of attributes need to be queried, enum virtio_ib_qp_attr_mask */ > + le32 attr_mask; > +}; > + > +struct virtio_rdma_ack_query_qp { Any chance to unify this with virtio_rdma_cmd_modify_qp? > + /* Move the QP to this state, enum virtio_ib_qp_state */ > + u8 qp_state; > + /* Path MTU (valid only for RC/UC QPs), enum virtio_ib_mtu */ > + u8 path_mtu; > + /* Is the SQ draining */ > + u8 sq_draining; > + /* Number of outstanding RDMA read operations on destination QP (valid only for RC QPs) */ > + u8 max_rd_atomic; > + /* Number of responder resources for handling incoming RDMA read operations (valid only for RC QPs) */ > + u8 max_dest_rd_atomic; > + /* Minimum RNR NAK timer (valid only for RC QPs) */ > + u8 min_rnr_timer; > + /* Local ack timeout (valid only for RC QPs) */ > + u8 timeout; > + /* Retry count (valid only for RC QPs) */ > + u8 retry_cnt; > + /* RNR retry (valid only for RC QPs) */ > + u8 rnr_retry; > + /* Padding */ > + u8 padding[7]; > + /* Q_Key for the QP (valid only for UD QPs) */ > + le32 qkey; > + /* PSN for RQ (valid only for RC/UC QPs) */ > + le32 rq_psn; > + /* PSN for SQ */ > + le32 sq_psn; > + /* Destination QP number (valid only for RC/UC QPs) */ > + le32 dest_qp_num; > + /* Mask of enabled remote access operations (valid only for RC/UC QPs), enum virtio_ib_access_flags */ > + le32 qp_access_flags; > + /* Rate limit in kbps for packet pacing */ > + le32 rate_limit; > + /* QP capabilities */ > + struct virtio_rdma_qp_cap cap; > + /* Address Vector (valid only for RC/UC QPs) */ > + struct virtio_rdma_ah_attr ah_attr; > + /* Reserved for future */ > + le32 reserved[4]; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_DESTROY_QP] Destroy a Queue Pair. > + The command-specific-data is \field{virtio_rdma_cmd_destroy_qp}; > + no ack-specific-data. What happen to the pending requests? Will the device wait for the completion or not? > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_destroy_qp { > + /* The index of QP */ > + le32 qpn; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_CREATE_AH] Create a Address Handle (AH). > + The command-specific-data is \field{virtio_rdma_cmd_create_ah}; > + the ack-specific-data is \field{virtio_rdma_ack_create_ah}. > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_create_ah { > + /* The handle of PD which the AH associated with */ > + le32 pdn; > + /* Padding */ > + le32 padding; > + /* Address Vector */ > + struct virtio_rdma_ah_attr ah_attr; > +}; > + > +struct virtio_rdma_ack_create_ah { > + /* The address handle */ > + le32 ah; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_DESTROY_AH] Destroy a Address Handle. > + The command-specific-data is \field{virtio_rdma_cmd_destroy_ah}; > + no ack-specific-data. > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_destroy_ah { > + /* The handle of PD which the AH associated with */ > + le32 pdn; > + /* The address handle */ > + le32 ah; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_ADD_GID] Add a Global Identifier (GID). > + The command-specific-data is \field{virtio_rdma_cmd_add_gid}; > + no ack-specific-data. > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_add_gid { > + /* Index of GID */ > + le16 index; > + /* Padding */ > + le16 padding[3]; > + /* GID to be added */ > + u8 gid[16]; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_DEL_GID] Delete a Global Identifier. > + The command-specific-data is \field{virtio_rdma_cmd_del_gid}; > + no ack-specific-data. > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_del_gid { > + /* Index of GID */ > + le16 index; > +}; > +\end{lstlisting} > + > +\item[VIRTIO_NET_CTRL_ROCE_REQ_NOTIFY_CQ] Request a completion notification > + on a Completion Queue. > + The command-specific-data is \field{virtio_rdma_cmd_req_notify}; > + no ack-specific-data. > + > +\begin{lstlisting} > +struct virtio_rdma_cmd_req_notify { > + /* The index of CQ */ > + le32 cqn; > +#define VIRTIO_IB_NOTIFY_SOLICITED (1 << 0) > +#define VIRTIO_IB_NOTIFY_NEXT_COMPLETION (1 << 1) Need to describe the differences on those two flags. > + /* Notify flags */ > + le32 flags; > +}; > +\end{lstlisting} > + > +\end{description} > + > +\drivernormative{\subparagraph}{RoCE Configuration}{Device Types / Network Device / Device Operation / Control Virtqueue / RoCE Configuration} > + > +A driver MUST initialize the completion virtqueue and fill it with > +enough entries after command VIRTIO_NET_CTRL_ROCE_CREATE_CQ is > +successfully executed. > + > +A driver MUST reset the completion virtqueue after How to do the reset? Do you mean driver need to reset the indices? > +command VIRTIO_NET_CTRL_ROCE_DESTROY_CQ is successfully executed. > + > +A driver MUST initialize the send virtqueue and receive virtqueue after > +command VIRTIO_NET_CTRL_ROCE_CREATE_QP is successfully executed. > + > +A driver MUST reset the send virtqueue and receive virtqueue after > +command VIRTIO_NET_CTRL_ROCE_DESTROY_QP is successfully executed. > > \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device > Types / Network Device / Legacy Interface: Framing Requirements} > @@ -4496,6 +5063,289 @@ \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device > See \ref{sec:Basic > Facilities of a Virtio Device / Virtqueues / Message Framing}. > > +\subsubsection{RoCE Support}\label{sec:Device Types / Network Device / Device Operation / RoCE Support} > + > +RDMA over Converged Ethernet (RoCE) is a network protocol that allows > +remote direct memory access (RDMA) over an Ethernet network. To support > +RoCE (if VIRTIO_NET_F_ROCE is negotiated), in addtion to the control > +virtqueue support mentioned in \ref{sec:Device Types / Network Device / > +Device Operation / Control Virtqueue / RoCE Configuration}, multiple > +types of virtqueues including send virtqueue, receive virtqueue and > +completion virtqueue are introduced. > + > +The send virtqueue contains elements that describe the data to be > +transmitted. > + > +Requests (device-readable) have the following format: > + > +\begin{lstlisting} > +enum virtio_ib_wr_opcode { > + VIRTIO_IB_WR_RDMA_WRITE, > + VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM, > + VIRTIO_IB_WR_SEND, > + VIRTIO_IB_WR_SEND_WITH_IMM, > + VIRTIO_IB_WR_RDMA_READ, > +}; > + > +struct virtio_rdma_sge { > + le64 addr; > + le32 length; > + le32 lkey; > +}; > + > +struct virtio_rdma_sq_req { > + /* User defined WR ID */ > + le64 wr_id; > + /* WR opcode, enum virtio_ib_wr_opcode */ > + u8 opcode; > +#define VIRTIO_IB_SEND_FENCE (1 << 0) > +#define VIRTIO_IB_SEND_SIGNALED (1 << 1) > +#define VIRTIO_IB_SEND_SOLICITED (1 << 2) > +#define VIRTIO_IB_SEND_INLINE (1 << 3) > + /* Flags of the WR properties */ > + u8 send_flags; > + /* Padding */ > + le16 padding; > + /* Immediate data (in network byte order) to send */ > + le32 imm_data; > + union { > + struct { > + /* Start address of remote memory buffer */ > + le64 remote_addr; > + /* Key of the remote MR */ > + le32 rkey; > + } rdma; > + struct { > + /* Index of the destination QP */ > + le32 remote_qpn; > + /* Q_Key of the destination QP */ > + le32 remote_qkey; > + /* Address Handle */ > + le32 ah; > + } ud; > + /* Reserved for future */ > + le64 reserved[4]; > + }; > + /* Inline data */ > + u8 inline_data[512]; > + union { > + /* Length of sg_list */ > + le32 num_sge; > + /* Length of inline data */ > + le16 inline_len; > + }; > + /* Reserved for future */ > + le32 reserved2[3]; > + /* Scatter/gather list */ > + struct virtio_rdma_sge sg_list[]; > +}; > +\end{lstlisting} > + > +The receive virtqueue contains elements that describe where to place incoming data. > + > +Requests (device-readable) have the following format: > + > +\begin{lstlisting} > +struct virtio_rdma_rq_req { > + /* User defined WR ID */ > + le64 wr_id; > + /* Length of sg_list */ > + le32 num_sge; > + /* Reserved for future */ > + le32 reserved[3]; > + /* Scatter/gather list */ > + struct virtio_rdma_sge sg_list[]; > +}; > +\end{lstlisting} > + > +The completion virtqueue is used to notify the completion of requests in > +send virtqueue or receive virtqueue. > + > +Requests (device-writable) have the following format: > + > +\begin{lstlisting} > +enum virtio_ib_wc_opcode { > + VIRTIO_IB_WC_SEND, > + VIRTIO_IB_WC_RDMA_WRITE, > + VIRTIO_IB_WC_RDMA_READ, > + VIRTIO_IB_WC_RECV, > + VIRTIO_IB_WC_RECV_RDMA_WITH_IMM, > +}; > + > +enum virtio_ib_wc_status { > + /* Operation completed successfully */ > + VIRTIO_IB_WC_SUCCESS, > + /* Local Length Error */ > + VIRTIO_IB_WC_LOC_LEN_ERR, > + /* Local QP Operation Error */ > + VIRTIO_IB_WC_LOC_QP_OP_ERR, > + /* Local Protection Error */ > + VIRTIO_IB_WC_LOC_PROT_ERR, > + /* Work Request Flushed Error */ > + VIRTIO_IB_WC_WR_FLUSH_ERR, > + /* Bad Response Error */ > + VIRTIO_IB_WC_BAD_RESP_ERR, > + /* Local Access Error */ > + VIRTIO_IB_WC_LOC_ACCESS_ERR, > + /* Remote Invalid Request Error */ > + VIRTIO_IB_WC_REM_INV_REQ_ERR, > + /* Remote Access Error */ > + VIRTIO_IB_WC_REM_ACCESS_ERR, > + /* Remote Operation Error */ > + VIRTIO_IB_WC_REM_OP_ERR, > + /* Transport Retry Counter Exceeded */ > + VIRTIO_IB_WC_RETRY_EXC_ERR, > + /* RNR Retry Counter Exceeded */ > + VIRTIO_IB_WC_RNR_RETRY_EXC_ERR, > + /* Remote Aborted Error */ > + VIRTIO_IB_WC_REM_ABORT_ERR, > + /* Fatal Error */ > + VIRTIO_IB_WC_FATAL_ERR, > + /* Response Timeout Error */ > + VIRTIO_IB_WC_RESP_TIMEOUT_ERR, > + /* General Error */ > + VIRTIO_IB_WC_GENERAL_ERR > +}; > + > +struct virtio_rdma_cq_req { > + /* User defined WR ID */ > + le64 wr_id; > + /* Work completion status, enum virtio_ib_wc_status */ > + u8 status; > + /* WR opcode, enum virtio_ib_wc_opcode */ > + u8 opcode; > + /* Padding */ > + le16 padding; > + /* Vendor error */ > + le32 vendor_err; > + /* Number of bytes transferred */ > + le32 byte_len; > + /* Immediate data (in network byte order) to send */ > + le32 imm_data; > + /* Local QP number of completed WR */ > + le32 qp_num; > + /* Source QP number (remote QP number) of completed WR (valid only for UD QPs) */ > + le32 src_qp; > +#define VIRTIO_IB_WC_GRH (1 << 0) > +#define VIRTIO_IB_WC_WITH_IMM (1 << 1) > + /* Work completion flag */ > + le32 wc_flags; > + /* Reserved for future */ > + le32 reserved[3]; > +}; > +\end{lstlisting} > + > +\paragraph{Send Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Send Operation} > + > +The send operation allows us to send data to a remote QP’s Receive Queue. > +The receiver MUST have previously posted a receive buffer to receive the data. "MUST" keyword must belong to the normative section. > + > +To do a send operation, a request with \field{opcode} set to > +VIRTIO_IB_WR_SEND or VIRTIO_IB_WR_SEND_WITH_IMM MUST be posted to the Send > +Queue as one output descriptor and the device is notified of the new entry. > + > +\drivernormative{\subparagraph}{Send Operation}{Device Types / Network Device / Device Operation / RoCE Support / Send Operation} > + > +If VIRTIO_IB_SEND_INLINE is set in \field{send_flags}, the driver MUST fill > +send buffer into \field{inline_data} field and set \field{inline_len} to the > +length of the buffer. Otherwise, the driver MUST fill \field{sg_list} to > +describe the buffer. > + > +\devicenormative{\subparagraph}{Send Operation}{Device Types / Network Device / Device Operation / RoCE Support / Send Operation} > + > +If \field{opcode} is not set to VIRTIO_IB_WR_SEND_WITH_IMM, the device MUST > +ignore \field{imm_data}. > + > +If the QP type is UD, the device MUST validate \field{ud.ah}. > + > +If VIRTIO_IB_SEND_INLINE is not set in \field{send_flags}, the device MUST > +validate the \field{addr}, \field{length} and \field{lkey} in \field{sg_list}. > + > +\paragraph{Receive Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Receive Operation} > + > +The receive operation allows us to receive data from remote QP. > +It's the corresponding operation to a send operation. > + > +To do a receive operation, a request MUST be posted to the Receive > +Queue as one output descriptor and the device is notified of the new entry. > + I think we probably need to be more verbose as what has been done for virtio-net. That is, describe what need to be filled in virtio_rdma_rq_req in details. (And do this for other operation as well) > +\drivernormative{\subparagraph}{Receive Operation}{Device Types / Network Device / Device Operation / RoCE Support / Receive Operation} > + > +The driver MUST fill \field{sg_list} to describe the receive buffer. > + > +\devicenormative{\subparagraph}{Receive Operation}{Device Types / Network Device / Device Operation / RoCE Support / Receive Operation} > + > +The device MUST validate the \field{addr}, \field{length} and \field{lkey} > +in \field{sg_list}. > + > +\paragraph{Write Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Write Operation} > + > +The write operation allows us to write data to the local memory buffer > +in remote side with no notification. The remote side wouldn't be aware > +that this operation being done. > + > +To do a write operation, a request with \field{opcode} set to > +VIRTIO_IB_WR_RDMA_WRITE or VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM MUST be > +posted to the Send Queue as one output descriptor and the device is > +notified of the new entry. > + > +\drivernormative{\subparagraph}{Write Operation}{Device Types / Network Device / Device Operation / RoCE Support / Write Operation} > + > +The driver MUST fill \field{sg_list} to describe the write buffer. So sg is a must even if the driver want to use imm? > + > +The driver MUST fill \field{rdma.remote_addr} and \field{rdma.rkey} to > +identify the remote buffer. > + > +\devicenormative{\subparagraph}{Write Operation}{Device Types / Network Device / Device Operation / RoCE Support / Write Operation} > + > +If \field{opcode} is not set to VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM, the device > +MUST ignore \field{imm_data}. > + > +The device MUST validate the \field{addr}, \field{length} and \field{lkey} > +in \field{sg_list}. > + > +\paragraph{Read Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Read Operation} > + > +The read operation allows us to read data from the local memory buffer > +in remote side with no notification. The remote side wouldn't be aware > +that this operation being done. > + > +To do a read operation, a request with \field{opcode} set to > +VIRTIO_IB_WR_RDMA_READ MUST be posted to the Send Queue as one output > +descriptor and the device is notified of the new entry. > + > +\drivernormative{\subparagraph}{Read Operation}{Device Types / Network Device / Device Operation / RoCE Support / Read Operation} > + > +The driver MUST fill \field{sg_list} to describe the read buffer. > + > +The driver MUST fill \field{rdma.remote_addr} and \field{rdma.rkey} to > +identify the remote buffer. > + > +\devicenormative{\subparagraph}{Read Operation}{Device Types / Network Device / Device Operation / RoCE Support / Read Operation} > + > +The device MUST validate the \field{addr}, \field{length} and \field{lkey} > +in \field{sg_list}. > + > +\paragraph{Completion Notification}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Completion Notification} > + > +After above operation is completed, a completion notification MUST > +be triggered by the device. For "completion notification", do you mean the virtqueue notification of cq or the making the buffer than contains cqe used? > To achieve that, the device MUST consume > +an entry of the Completion Queue associated with the Send Queue/Receive > +Queue which the operation belongs to. > + > +\drivernormative{\subparagraph}{Completion Notification}{Device Types / Network Device / Device Operation / RoCE Support / Completion Notification} > + > +The driver MUST fill the Completion Queue with enough entries previously. What do you mean by "previously"? What happens if there's no sufficient cqe? Thanks > + > +\devicenormative{\subparagraph}{Completion Notification}{Device Types / Network Device / Device Operation / RoCE Support / Completion Notification} > + > +If \field{imm_data} is valid, the device MUST set VIRTIO_IB_WC_WITH_IMM to > +\field{wc_flags}. > + > +The device MUST set \field{wr_id} to the value of \field{wr_id} of > +corresponding \field{struct virtio_rdma_sq_req} or > +\field{struct virtio_rdma_rq_req}. > + > \section{Block Device}\label{sec:Device Types / Block Device} > > The virtio block device is a simple virtual block device (ie. > -- > 2.11.0 >