Re: [RFC v2] virtio-net: Add RoCE (RDMA over Converged Ethernet) support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Aug 4, 2022 at 4:30 PM Jason Wang <jasowang@xxxxxxxxxx> wrote:
>
> On Wed, May 11, 2022 at 5:59 PM Xie Yongji <xieyongji@xxxxxxxxxxxxx> wrote:
> >
> > Hi all,
> >
>
> Not very familiar with ROCE, try to give some comments from general
> virtio level.
>

Thank you!

> > This RFC aims to introduce our recent work on enabling RoCE support
> > for virtio-net device.
>
> We need to clarify the version of ROCE, is it ROCEv2 or not?
>

Yes, it's RoCE v2.

> >
> > To support RoCE, three types of virtqueues including RDMA send virtqueue,
> > RDMA receive virtqueue and RDMA completion virtqueue are introduced.
> > And control virtqueue is reused to support the RDMA control messages.
> >
> > Now we support some basic RDMA semantics such as send/receive
> > and read/write operation.
>
> It would be better to explain the advantages of this over the existing
> pvrdma approach. I guess one advantage is that using virtio makes it
> easier to connect to a userspace dataplane through vDPA/vhost-user?
>

Yes, this is one advantage. Another one is that we don't need a
physical RDMA-capable NIC.

> >
> > To test with our demo:
> >
> > 1. Build Guest kernel [1] with config INFINIBAND_VIRTIO_RDMA
> >
> > 2. Build QEMU [2] with config VHOST_USER_RDMA
> >
> > 3. Build rdma-core [3]
> >
> > 4. Build and install DPDK (NOTE that we only tested on DPDK 20.11.3)
> >
> > 5. Build vhost-user-rdma [4]
> >
> > 6. Run vhost-user-rdma with command:
> >     $ ./vhost-user-rdma --vdev 'net_tap0' --lcore '1-3' -- -s '/tmp/vhost-rdma0'
> >
> > 7. Run qemu with command:
> >     $ qemu-system-x86_64 -chardev socket,path=/tmp/vhost-rdma0,id=vrdma \
> >       -device vhost-user-rdma-pci,page-per-vq,chardev=vrdma ...
>
> It would be better to give some performance numbers (or even compare
> it with pvrdma).
>

OK, will do it in v3.

> >
> > [1] https://github.com/bytedance/linux/tree/virtio-net-roce
> > [2] https://github.com/bytedance/qemu/tree/vhost-user-rdma
> > [3] https://github.com/YongjiXie/rdma-core/tree/virtio-rdma
> > [4] https://github.com/YongjiXie/vhost-user-rdma
> >
> > We have already tested it with ibv_rc_pingpong, ibv_ud_pingpong and some
> > others in rdma-core.
> >
> > TODO:
> >
>
> And we'd better consider the live migration support. Having a quick
> glance, it looks to me trapping the cvq is sufficient?
>

I'm not sure. Each QP has its own state machine, which may also
require save & restore.

> > 1. Add support for Base Memory Management Extensions
> >
> > 2. Add support for atomic operation
> >
> > 3. Add support for SRQ
> >
> > 4. Add support for virtqueue resize
>
> Note that this is already supported by the spec via virtqueue reset.
>

OK.

> >
> > 5. Add support for enabling/disabling virtqueue at runtime
>
> I guess virtqueue reset could help in this case.
>

We might need to do some extension since we want to free the resources
when disabling the queue.

> >
> > Please review, thanks!
> >
> > V1 to V2:
> > - Rework the implementation via extending virtio-net instead of
> >   introducing a new device type [Jason]
> > - Add address handle support
> >
> > Signed-off-by: Xie Yongji <xieyongji@xxxxxxxxxxxxx>
> > Co-developed-by: Wei Junji <weijunji@xxxxxxxxxxxxx>
> > Signed-off-by: Wei Junji <weijunji@xxxxxxxxxxxxx>
> > ---
> >  content.tex | 858 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 854 insertions(+), 4 deletions(-)
>
> I wonder if there's some open-source ROCE transport device API that we
> can re-use then we can just behave like a transport layer instead of
> inventing new commands.
>

That's would be better. But I didn't find one.

> >
> > diff --git a/content.tex b/content.tex
> > index 7508dd1..646d82a 100644
> > --- a/content.tex
> > +++ b/content.tex
> > @@ -3008,7 +3008,10 @@ \section{Network Device}\label{sec:Device Types / Network Device}
> >  placed in one virtqueue for receiving packets, and outgoing
> >  packets are enqueued into another for transmission in that order.
> >  A third command queue is used to control advanced filtering
> > -features.
> > +features. And if RoCE (RDMA over Converged Ethernet) capability
> > +is enabled, the virtio network device can also support transmitting
> > +and receiving RDMA message through RDMA send virtqueue, RDMA receive
> > +virtqueue and RDMA completion virtqueue.
> >
> >  \subsection{Device ID}\label{sec:Device Types / Network Device / Device ID}
> >
> > @@ -3023,13 +3026,24 @@ \subsection{Virtqueues}\label{sec:Device Types / Network Device / Virtqueues}
> >  \item[2(N-1)] receiveqN
> >  \item[2(N-1)+1] transmitqN
> >  \item[2N] controlq
> > +\item[2N+1] rdma_completeq1
> > +\item[\ldots]
> > +\item[2N+M] rdma_completeqM
> > +\item[2N+M+1] rdma_transmitq1
> > +\item[2N+M+2] rdma_receiveq1
> > +\item[\ldots]
> > +\item[2N+M+2L-1] rdma_transmitqL
> > +\item[2N+M+2L] rdma_receiveqL
> >  \end{description}
> >
> >   N=1 if neither VIRTIO_NET_F_MQ nor VIRTIO_NET_F_RSS are negotiated, otherwise N is set by
> > - \field{max_virtqueue_pairs}.
> > + \field{max_virtqueue_pairs}. M is set by \field{max_rdma_cqs} and L is set by
> > + \field{max_rdma_qps}.
> >
> >   controlq only exists if VIRTIO_NET_F_CTRL_VQ set.
> >
> > + rdma_completeq, rdma_transmitq and rdma_receiveq only exist if VIRTIO_NET_F_ROCE set
> > +
> >  \subsection{Feature bits}\label{sec:Device Types / Network Device / Feature bits}
> >
> >  \begin{description}
> > @@ -3084,6 +3098,9 @@ \subsection{Feature bits}\label{sec:Device Types / Network Device / Feature bits
> >  \item[VIRTIO_NET_F_CTRL_MAC_ADDR(23)] Set MAC address through control
> >      channel.
> >
> > +\item[VIRTIO_NET_F_ROCE(55)] Device supports RoCE (RDMA over Converged Ethernet)
> > +    capability.
> > +
> >  \item[VIRTIO_NET_F_HOST_USO (56)] Device can receive USO packets. Unlike UFO
> >   (fragmenting the packet) the USO splits large UDP packet
> >   to several segments when each of these smaller packets has UDP header.
> > @@ -3129,6 +3146,7 @@ \subsubsection{Feature bit requirements}\label{sec:Device Types / Network Device
> >  \item[VIRTIO_NET_F_GUEST_ANNOUNCE] Requires VIRTIO_NET_F_CTRL_VQ.
> >  \item[VIRTIO_NET_F_MQ] Requires VIRTIO_NET_F_CTRL_VQ.
> >  \item[VIRTIO_NET_F_CTRL_MAC_ADDR] Requires VIRTIO_NET_F_CTRL_VQ.
> > +\item[VIRTIO_NET_F_ROCE] Requires VIRTIO_NET_F_CTRL_VQ.
> >  \item[VIRTIO_NET_F_RSC_EXT] Requires VIRTIO_NET_F_HOST_TSO4 or VIRTIO_NET_F_HOST_TSO6.
> >  \item[VIRTIO_NET_F_RSS] Requires VIRTIO_NET_F_CTRL_VQ.
> >  \end{description}
> > @@ -3190,6 +3208,8 @@ \subsection{Device configuration layout}\label{sec:Device Types / Network Device
> >          u8 rss_max_key_size;
> >          le16 rss_max_indirection_table_length;
> >          le32 supported_hash_types;
> > +        le32 max_rdma_qps;
> > +        le32 max_rdma_cps;
> >  };
> >  \end{lstlisting}
> >  The following field, \field{rss_max_key_size} only exists if VIRTIO_NET_F_RSS or VIRTIO_NET_F_HASH_REPORT is set.
> > @@ -3204,11 +3224,23 @@ \subsection{Device configuration layout}\label{sec:Device Types / Network Device
> >  Field \field{supported_hash_types} contains the bitmask of supported hash types.
> >  See \ref{sec:Device Types / Network Device / Device Operation / Processing of Incoming Packets / Hash calculation for incoming packets / Supported/enabled hash types} for details of supported hash types.
> >
> > +Field \field{max_rdma_qps} only exists if VIRTIO_NET_F_ROCE is set.
> > +It specifies the maximum number of queue pairs (send virtqueue and receive virtqueue) for RoCE usage.
> > +
> > +Field \field{max_rdma_cqs} only exists if VIRTIO_NET_F_ROCE is set.
> > +It specifies the maximum number of completion virtqueue for RoCE usage.
> > +
> >  \devicenormative{\subsubsection}{Device configuration layout}{Device Types / Network Device / Device configuration layout}
> >
> >  The device MUST set \field{max_virtqueue_pairs} to between 1 and 0x8000 inclusive,
> >  if it offers VIRTIO_NET_F_MQ.
> >
> > +The device MUST set \field{max_rdma_qps} to between 1 an 16384 inclusive,
> > +if it offers VIRTIO_NET_F_ROCE.
>
> I wonder why 16384 is chosen here?
>

Since the max queue number is limited to 65536 and we have three types
of queue, the queue number should be less than 65536 / 3. We choose
65536 / 4 here.

> > +
> > +The device MUST set \field{max_rdma_cqs} to between 1 an 16384 inclusive,
> > +if it offers VIRTIO_NET_F_ROCE.
> > +
> >  The device MUST set \field{mtu} to between 68 and 65535 inclusive,
> >  if it offers VIRTIO_NET_F_MTU.
> >
> > @@ -3306,6 +3338,12 @@ \subsection{Device Initialization}\label{sec:Device Types / Network Device / Dev
> >  \item If the VIRTIO_NET_F_CTRL_VQ feature bit is negotiated,
> >    identify the control virtqueue.
> >
> > +\item If the VIRTIO_NET_F_ROCE feature bit is negotiated,
> > +  identify the the RDMA completion virtqueues, up to max_rdma_cqs.
> > +
> > +\item If the VIRTIO_NET_F_ROCE feature bit is negotiated,
> > +  identify the the RDMA send and receive virtqueues, up to max_rdma_qps.
> > +
> >  \item Fill the receive queues with buffers: see \ref{sec:Device Types / Network Device / Device Operation / Setting Up Receive Buffers}.
> >
> >  \item Even with VIRTIO_NET_F_MQ, only receiveq1, transmitq1 and
> > @@ -4007,6 +4045,7 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi
> >          u8 command;
> >          u8 command-specific-data[];
> >          u8 ack;
> > +        u8 ack-specific-data[];
> >  };
> >
> >  /* ack values */
> > @@ -4015,8 +4054,8 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi
> >  \end{lstlisting}
> >
> >  The \field{class}, \field{command} and command-specific-data are set by the
> > -driver, and the device sets the \field{ack} byte. There is little it can
> > -do except issue a diagnostic if \field{ack} is not
> > +driver, and the device sets the \field{ack} byte and ack-specific-data. There
> > +is little it can do except issue a diagnostic if \field{ack} is not
> >  VIRTIO_NET_OK.
> >
> >  \paragraph{Packet Receive Filtering}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / Packet Receive Filtering}
> > @@ -4463,6 +4502,534 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi
> >  according to the native endian of the guest rather than
> >  (necessarily when not using the legacy interface) little-endian.
> >
> > +\paragraph{RoCE Configuration}\label{sec:Device Types / Network Device / Device Operation / Control Virtqueue / RoCE Configuration}
> > +
> > +If the driver negotiates the VIRTIO_NET_F_ROCE feature bit (depends on VIRTIO_NET_F_CTRL_VQ),
> > +it can send control commands for RoCE usage. The following commands are defined now:
> > +
> > +\begin{lstlisting}
> > +#define VIRTIO_NET_CTRL_ROCE    6
> > + #define VIRTIO_NET_CTRL_ROCE_QUERY_DEVICE      0
> > + #define VIRTIO_NET_CTRL_ROCE_QUERY_PORT        1
> > + #define VIRTIO_NET_CTRL_ROCE_CREATE_CQ         2
> > + #define VIRTIO_NET_CTRL_ROCE_DESTROY_CQ        3
> > + #define VIRTIO_NET_CTRL_ROCE_CREATE_PD         4
> > + #define VIRTIO_NET_CTRL_ROCE_DESTROY_PD        5
> > + #define VIRTIO_NET_CTRL_ROCE_GET_DMA_MR        6
> > + #define VIRTIO_NET_CTRL_ROCE_REG_USER_MR       7
> > + #define VIRTIO_NET_CTRL_ROCE_DEREG_MR          8
> > + #define VIRTIO_NET_CTRL_ROCE_CREATE_QP         9
> > + #define VIRTIO_NET_CTRL_ROCE_MODIFY_QP         10
> > + #define VIRTIO_NET_CTRL_ROCE_QUERY_QP          11
> > + #define VIRTIO_NET_CTRL_ROCE_DESTROY_QP        12
> > + #define VIRTIO_NET_CTRL_ROCE_CREATE_AH         13
> > + #define VIRTIO_NET_CTRL_ROCE_DESTROY_AH        14
> > + #define VIRTIO_NET_CTRL_ROCE_ADD_GID           15
> > + #define VIRTIO_NET_CTRL_ROCE_DEL_GID           16
> > + #define VIRTIO_NET_CTRL_ROCE_REQ_NOTIFY_CQ     17
> > +\end{lstlisting}
> > +
> > +\begin{description}
> > +\item[VIRTIO_NET_CTRL_ROCE_QUERY_DEVICE] Query the attributes of device.
> > +  No command-specific-data;
> > +  the ack-specific-data is \field{struct virtio_rdma_ack_query_device}.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_ack_query_device {
> > +#define VIRTIO_IB_DEVICE_RC_RNR_NAK_GEN    (1 << 0)
>
> What's the meaning of this capability?
>

It indicates whether the device supports RNR-NAK generation for RC
QPs. I will add some comments.

> > +        /* Capabilities mask */
> > +        le64 device_cap_flags;
>
> Will this introduce a migration compatibility issue? E.g src and dst
> have the same features but different capabilities.
>

Should this be controlled by hypervisor since all capabilities is
emulated by software.

> > +        /* Largest contiguous block that can be registered */
> > +        le64 max_mr_size;
> > +        /* Supported memory shift sizes */
> > +        le64 page_size_cap;
> > +        /* Hardware version */
> > +        le32 hw_ver;
>
> What did "hardware version" mean? Is this something that is defined in
> the IB spec?
>

Yes, it's defined in IB spec.

> > +        /* Maximum number of outstanding Work Requests (WR) on Send Queue (SQ) and Receive Queue (RQ) */
> > +        le32 max_qp_wr;
>
> Is this implied in the virtqueue size? If not, why?
>

Yes. Will remove it.

> > +        /* Maximum number of scatter/gather (s/g) elements per WR for SQ for non RDMA Read operations */
> > +        le32 max_send_sge;
> > +        /* Maximum number of s/g elements per WR for RQ for non RDMA Read operations */
> > +        le32 max_recv_sge;
> > +        /* Maximum number of s/g per WR for RDMA Read operations */
> > +        le32 max_sge_rd;
> > +        /* Maximum size of Completion Queue (CQ) */
> > +        le32 max_cqe;
>
> Need to specify the reason why we can't use the virtqueue size for the
> completion queue.
>

I think we can. Will remove it

> > +        /* Maximum number of Memory Regions (MR) */
> > +        le32 max_mr;
> > +        /* Maximum number of Protection Domains (PD) */
> > +        le32 max_pd;
> > +        /* Maximum number of RDMA Read perations that can be outstanding per Queue Pair (QP) */
>
> I guess you mean "operations" here.
>

Yes.

> > +        le32 max_qp_rd_atom;
> > +        /* Maximum depth per QP for initiation of RDMA Read operations */
>
> The member has an "atom" suffix, does it mean "atomic read" or other?
>

It means the atomic operation which is unsupported now. I think we
need to remove it.

> > +        le32 max_qp_init_rd_atom;
> > +        /* Maximum number of Address Handles (AH) */
> > +        le32 max_ah;
> > +        /* Local CA ack delay */
> > +        u8 local_ca_ack_delay;
> > +        /* Padding */
> > +        u8 padding[3];
> > +        /* Reserved for future */
> > +        le32 reserved[14];
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_QUERY_PORT] Query the attributes of port.
> > +  No command-specific-data;
> > +  the ack-specific-data is \field{struct virtio_rdma_ack_query_port}.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_ack_query_port {
> > +        /* Length of source Global Identifier (GID) table */
> > +        le32 gid_tbl_len;
> > +        /* Maximum message size */
> > +        le32 max_msg_sz;
>
> I guess this is for both read/write/send/receive? And is 4GB
> sufficient for the future?
>

Now this follows the definition in linux kernel and IB Spec. If we
need to extend it in future, we can add a new field max_msg_sz64?

> > +        /* Reserved for future */
> > +        le32 reserved[6];
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_CREATE_CQ] Create a Completion Queue (CQ).
> > +  The command-specific-data is \field{struct virtio_rdma_cmd_create_cq};
> > +  the ack-specific-data is \field{struct virtio_rdma_ack_create_cq}.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_create_cq {
> > +        /* Size of CQ */
> > +        le32 cqe;
> > +};
> > +
> > +struct virtio_rdma_ack_create_cq {
> > +        /* The index of CQ */
> > +        le32 cqn;
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_DESTROY_CQ] Destroy a Completion Queue.
> > +  The command-specific-data is \field{struct virtio_rdma_cmd_destroy_cq};
> > +  no ack-specific-data.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_destroy_cq {
> > +        /* The index of CQ */
> > +        le32 cqn;
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_CREATE_PD] Create a Protection Domain (PD).
> > +  No command-specific-data;
> > +  the ack-specific-data is \field{struct virtio_rdma_ack_create_pd}.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_ack_create_pd {
> > +        /* The handle of PD */
> > +        le32 pdn;
> > +};
> > +\end{lstlisting}
>
> Can this command always succeed? I meant is there a limit of the total
> number of PDs that a single ROCE device can support?
>

Yes, we have max_pd field in structure virtio_rdma_ack_query_device.

> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_DESTORY_PD] Destroy a Protection Domain.
> > +  The command-specific-data is \field{virtio_rdma_cmd_destroy_pd};
> > +  no ack-specific-data.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_destroy_pd {
> > +        /* The handle of PD */
> > +        le32 pdn;
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_GET_DMA_MR] Get the DMA Memory Region (MR).
> > +  associated with one protection domain.
>
> I wonder what's the difference between VIRTIO_NET_CTRL_ROCE_GET_DMA_MR
> and USR_MR. Can we unify them?
>

We should pass some address for USER_MR. I think we can unify them if we want.

> > +  The command-specific-data is \field{virtio_rdma_cmd_get_dma_mr};
> > +  the ack-specific-data is \field{virtio_rdma_ack_get_dma_mr}.
> > +
> > +\begin{lstlisting}
> > +enum virtio_ib_access_flags {
> > +        VIRTIO_IB_ACCESS_LOCAL_WRITE = (1 << 0),
>
> Is LOCAL_READ implied to work always?
>

Yes, the LOCAL_READ is always supported.

> > +        VIRTIO_IB_ACCESS_REMOTE_WRITE = (1 << 1),
> > +        VIRTIO_IB_ACCESS_REMOTE_READ = (1 << 2),
> > +};
> > +
> > +struct virtio_rdma_cmd_get_dma_mr {
> > +        /* The handle of PD which the MR associated with */
> > +        le32 pdn;
> > +        /* MR's protection attributes, enum virtio_ib_access_flags */
> > +        le32 access_flags;
> > +};
> > +
> > +struct virtio_rdma_ack_get_dma_mr {
> > +        /* The handle of MR */
> > +        le32 mrn;
> > +        /* MR's local access key */
> > +        le32 lkey;
> > +        /* MR's remote access key */
> > +        le32 rkey;
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_REG_USER_MR] Register a user Memory Region
> > +  associated with one Protection Domain.
> > +  The command-specific-data is \field{virtio_rdma_cmd_reg_user_mr};
> > +  the ack-specific-data is \field{virtio_rdma_ack_reg_user_mr}.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_reg_user_mr {
> > +        /* The handle of PD which the MR associated with */
> > +        le32 pdn;
> > +        /* MR's protection attributes, enum virtio_ib_access_flags */
> > +        le32 access_flags;
> > +        /* Starting virtual address of MR */
> > +        le64 virt_addr;
>
> I guess this is actually the I/O virtual address and the device is in
> charge of translate it to the page arrays below?
>

Yes, this address is specified by userspace, which can be a virtual
address or not.

> > +        /* Length of MR */
> > +        le64 length;
> > +        /* Size of the below page array */
> > +        le32 npages;
> > +        /* Padding */
> > +        le32 padding;
> > +        /* Array to store physical address of each page in MR */
> > +        le64 pages[];
>
> How do device know the size of a page?
>

We have npages field in this struture.

> > +};
>
> I believe this command can fail, we need to describe the error conditions.
>

OK.

> > +
> > +struct virtio_rdma_ack_reg_user_mr {
> > +        /* The handle of MR */
> > +        le32 mrn;
> > +        /* MR's local access key */
> > +        le32 lkey;
> > +        /* MR's remote access key */
> > +        le32 rkey;
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_DEREG_MR] De-register a Memory Region.
> > +  The command-specific-data is \field{virtio_rdma_cmd_dereg_mr};
> > +  no ack-specific-data.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_dereg_mr {
> > +        /* The handle of MR */
> > +        le32 mrn;
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_CREATE_QP] Create a Queue Pair (Send Queue and Receive Queue).
> > +  The command-specific-data is \field{virtio_rdma_cmd_create_qp};
> > +  the ack-specific-data is \field{virtio_rdma_ack_create_qp}.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_qp_cap {
> > +        /* Maximum number of outstanding WRs in SQ */
> > +        le32 max_send_wr;
> > +        /* Maximum number of outstanding WRs in RQ */
> > +        le32 max_recv_wr;
> > +        /* Maximum number of s/g elements per WR in SQ */
> > +        le32 max_send_sge;
> > +        /* Maximum number of s/g elements per WR in RQ */
> > +        le32 max_recv_sge;
> > +        /* Maximum number of data (bytes) that can be posted inline to SQ */
> > +        le32 max_inline_data;
> > +        /* Padding */
> > +        le32 padding;
> > +};
> > +
> > +struct virtio_rdma_cmd_create_qp {
> > +        /* The handle of PD which the QP associated with */
> > +        le32 pdn;
> > +#define VIRTIO_IB_QPT_SMI    0
> > +#define VIRTIO_IB_QPT_GSI    1
> > +#define VIRTIO_IB_QPT_RC     2
> > +#define VIRTIO_IB_QPT_UC     3
> > +#define VIRTIO_IB_QPT_UD     4
> > +        /* QP's type */
> > +        u8 qp_type;
> > +        /* If set, each WR submitted to the SQ generates a completion entry */
> > +        u8 sq_sig_all;
> > +        /* Padding */
> > +        u8 padding[2];
> > +        /* The index of CQ which the SQ associated with */
> > +        le32 send_cqn;
> > +        /* The index of CQ which the RQ associated with */
> > +        le32 recv_cqn;
> > +        /* QP's capabilities */
> > +        struct virtio_rdma_qp_cap cap;
> > +        /* Reserved for future */
> > +        le32 reserved[4];
> > +};
> > +
> > +struct virtio_rdma_ack_create_qp {
> > +        /* The index of QP */
> > +        le32 qpn;
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_MODIFY_QP] Modify the attributes of a Queue Pair.
> > +  The command-specific-data is \field{virtio_rdma_cmd_modify_qp};
> > +  no ack-specific-data.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_global_route {
> > +        /* Destination GID or MGID */
> > +        u8 dgid[16];
> > +        /* Flow label */
> > +        le32 flow_label;
> > +        /* Source GID index */
> > +        u8 sgid_index;
> > +        /* Hop limit */
> > +        u8 hop_limit;
> > +        /* Traffic class */
> > +        u8 traffic_class;
> > +        /* Padding */
> > +        u8 padding;
> > +};
> > +
> > +struct virtio_rdma_ah_attr {
> > +        /* Global Routing Header (GRH) attributes */
> > +        virtio_rdma_global_route grh;
> > +        /* Destination MAC address */
> > +        u8 dmac[6];
> > +        /* Reserved for future */
> > +        u8 reserved[10];
> > +};
> > +
> > +enum virtio_ib_qp_attr_mask {
> > +        VIRTIO_IB_QP_STATE = (1 << 0),
> > +        VIRTIO_IB_QP_CUR_STATE = (1 << 1),
> > +        VIRTIO_IB_QP_ACCESS_FLAGS = (1 << 2),
> > +        VIRTIO_IB_QP_QKEY = (1 << 3),
> > +        VIRTIO_IB_QP_AV = (1 << 4),
> > +        VIRTIO_IB_QP_PATH_MTU = (1 << 5),
> > +        VIRTIO_IB_QP_TIMEOUT = (1 << 6),
> > +        VIRTIO_IB_QP_RETRY_CNT = (1 << 7),
> > +        VIRTIO_IB_QP_RNR_RETRY = (1 << 8),
> > +        VIRTIO_IB_QP_RQ_PSN = (1 << 9),
> > +        VIRTIO_IB_QP_MAX_QP_RD_ATOMIC = (1 << 10),
> > +        VIRTIO_IB_QP_MIN_RNR_TIMER = (1 << 11),
> > +        VIRTIO_IB_QP_SQ_PSN = (1 << 12),
> > +        VIRTIO_IB_QP_MAX_DEST_RD_ATOMIC = (1 << 13),
> > +        VIRTIO_IB_QP_CAP = (1 << 14),
> > +        VIRTIO_IB_QP_DEST_QPN = (1 << 15),
> > +        VIRTIO_IB_QP_RATE_LIMIT = (1 << 16),
> > +};
>
> Do we need to explain the above error codes? Or it's simply a map from IB spec?
>

Yes, it's defined in IB spec. But we can add some comments for them too.

> > +
> > +enum virtio_ib_qp_state {
> > +        VIRTIO_IB_QPS_RESET,
> > +        VIRTIO_IB_QPS_INIT,
> > +        VIRTIO_IB_QPS_RTR,
> > +        VIRTIO_IB_QPS_RTS,
> > +        VIRTIO_IB_QPS_SQD,
> > +        VIRTIO_IB_QPS_SQE,
> > +        VIRTIO_IB_QPS_ERR
> > +};
> > +
> > +enum virtio_ib_mtu {
> > +        VIRTIO_IB_MTU_256 = 1,
> > +        VIRTIO_IB_MTU_512 = 2,
> > +        VIRTIO_IB_MTU_1024 = 3,
> > +        VIRTIO_IB_MTU_2048 = 4,
> > +        VIRTIO_IB_MTU_4096 = 5
> > +};
> > +
> > +struct virtio_rdma_cmd_modify_qp {
> > +        /* The index of QP */
> > +        le32 qpn;
> > +        /* The mask of attributes needs to be modified, enum virtio_ib_qp_attr_mask */
> > +        le32 attr_mask;
> > +        /* Move the QP to this state, enum virtio_ib_qp_state */
> > +        u8 qp_state;
> > +        /* Current QP state, enum virtio_ib_qp_state */
> > +        u8 cur_qp_state;
> > +        /* Path MTU (valid only for RC/UC QPs), enum virtio_ib_mtu */
> > +        u8 path_mtu;
> > +        /* Number of outstanding RDMA Read operations on destination QP (valid only for RC QPs) */
> > +        u8 max_rd_atomic;
> > +        /* Number of responder resources for handling incoming RDMA Read operations (valid only for RC QPs) */
> > +        u8 max_dest_rd_atomic;
> > +        /* Minimum RNR (Receiver Not Ready) NAK timer (valid only for RC QPs) */
> > +        u8 min_rnr_timer;
> > +        /* Local ack timeout (valid only for RC QPs) */
> > +        u8 timeout;
> > +        /* Retry count (valid only for RC QPs) */
> > +        u8 retry_cnt;
> > +        /* RNR retry (valid only for RC QPs) */
> > +        u8 rnr_retry;
> > +        /* Padding */
> > +        u8 padding[7];
> > +        /* Q_Key for the QP (valid only for UD QPs) */
> > +        le32 qkey;
> > +        /* PSN for RQ (valid only for RC/UC QPs) */
> > +        le32 rq_psn;
> > +        /* PSN for SQ */
> > +        le32 sq_psn;
> > +        /* Destination QP number (valid only for RC/UC QPs) */
> > +        le32 dest_qp_num;
> > +        /* Mask of enabled remote access operations (valid only for RC/UC QPs), enum virtio_ib_access_flags */
> > +        le32 qp_access_flags;
> > +        /* Rate limit in kbps for packet pacing */
> > +        le32 rate_limit;
> > +        /* QP capabilities */
> > +        struct virtio_rdma_qp_cap cap;
> > +        /* Address Vector (valid only for RC/UC QPs) */
> > +        struct virtio_rdma_ah_attr ah_attr;
> > +        /* Reserved for future */
> > +        le32 reserved[4];
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_QUERY_QP] Query the attributes of a Queue Pair.
> > +  The command-specific-data is \field{virtio_rdma_cmd_query_qp};
> > +  the ack-specific-data is \field{virtio_rdma_ack_query_qp}.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_query_qp {
> > +       /* The index of QP */
> > +        le32 qpn;
> > +        /* The mask of attributes need to be queried, enum virtio_ib_qp_attr_mask */
> > +        le32 attr_mask;
> > +};
> > +
> > +struct virtio_rdma_ack_query_qp {
>
> Any chance to unify this with virtio_rdma_cmd_modify_qp?
>

It would be a little confusing since some states is only used by modify_qp.

> > +        /* Move the QP to this state, enum virtio_ib_qp_state */
> > +        u8 qp_state;
> > +        /* Path MTU (valid only for RC/UC QPs), enum virtio_ib_mtu */
> > +        u8 path_mtu;
> > +        /* Is the SQ draining */
> > +        u8 sq_draining;
> > +        /* Number of outstanding RDMA read operations on destination QP (valid only for RC QPs) */
> > +        u8 max_rd_atomic;
> > +        /* Number of responder resources for handling incoming RDMA read operations (valid only for RC QPs) */
> > +        u8 max_dest_rd_atomic;
> > +        /* Minimum RNR NAK timer (valid only for RC QPs) */
> > +        u8 min_rnr_timer;
> > +        /* Local ack timeout (valid only for RC QPs) */
> > +        u8 timeout;
> > +        /* Retry count (valid only for RC QPs) */
> > +        u8 retry_cnt;
> > +        /* RNR retry (valid only for RC QPs) */
> > +        u8 rnr_retry;
> > +        /* Padding */
> > +        u8 padding[7];
> > +        /* Q_Key for the QP (valid only for UD QPs) */
> > +        le32 qkey;
> > +        /* PSN for RQ (valid only for RC/UC QPs) */
> > +        le32 rq_psn;
> > +        /* PSN for SQ */
> > +        le32 sq_psn;
> > +        /* Destination QP number (valid only for RC/UC QPs) */
> > +        le32 dest_qp_num;
> > +        /* Mask of enabled remote access operations (valid only for RC/UC QPs), enum virtio_ib_access_flags */
> > +        le32 qp_access_flags;
> > +        /* Rate limit in kbps for packet pacing */
> > +        le32 rate_limit;
> > +        /* QP capabilities */
> > +        struct virtio_rdma_qp_cap cap;
> > +        /* Address Vector (valid only for RC/UC QPs) */
> > +        struct virtio_rdma_ah_attr ah_attr;
> > +        /* Reserved for future */
> > +        le32 reserved[4];
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_DESTROY_QP] Destroy a Queue Pair.
> > +  The command-specific-data is \field{virtio_rdma_cmd_destroy_qp};
> > +  no ack-specific-data.
>
> What happen to the pending requests? Will the device wait for the
> completion or not?
>

It should be discarded according to IB spec.

> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_destroy_qp {
> > +        /* The index of QP */
> > +        le32 qpn;
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_CREATE_AH] Create a Address Handle (AH).
> > +  The command-specific-data is \field{virtio_rdma_cmd_create_ah};
> > +  the ack-specific-data is \field{virtio_rdma_ack_create_ah}.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_create_ah {
> > +        /* The handle of PD which the AH associated with */
> > +        le32 pdn;
> > +        /* Padding */
> > +        le32 padding;
> > +        /* Address Vector */
> > +        struct virtio_rdma_ah_attr ah_attr;
> > +};
> > +
> > +struct virtio_rdma_ack_create_ah {
> > +        /* The address handle */
> > +        le32 ah;
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_DESTROY_AH] Destroy a Address Handle.
> > +  The command-specific-data is \field{virtio_rdma_cmd_destroy_ah};
> > +  no ack-specific-data.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_destroy_ah {
> > +        /* The handle of PD which the AH associated with */
> > +        le32 pdn;
> > +        /* The address handle */
> > +        le32 ah;
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_ADD_GID] Add a Global Identifier (GID).
> > +  The command-specific-data is \field{virtio_rdma_cmd_add_gid};
> > +  no ack-specific-data.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_add_gid {
> > +        /* Index of GID */
> > +        le16 index;
> > +        /* Padding */
> > +        le16 padding[3];
> > +        /* GID to be added */
> > +        u8 gid[16];
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_DEL_GID] Delete a Global Identifier.
> > +  The command-specific-data is \field{virtio_rdma_cmd_del_gid};
> > +  no ack-specific-data.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_del_gid {
> > +        /* Index of GID */
> > +        le16 index;
> > +};
> > +\end{lstlisting}
> > +
> > +\item[VIRTIO_NET_CTRL_ROCE_REQ_NOTIFY_CQ] Request a completion notification
> > +  on a Completion Queue.
> > +  The command-specific-data is \field{virtio_rdma_cmd_req_notify};
> > +  no ack-specific-data.
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_cmd_req_notify {
> > +        /* The index of CQ */
> > +        le32 cqn;
> > +#define VIRTIO_IB_NOTIFY_SOLICITED (1 << 0)
> > +#define VIRTIO_IB_NOTIFY_NEXT_COMPLETION (1 << 1)
>
> Need to describe the differences on those two flags.
>

OK.

> > +        /* Notify flags */
> > +        le32 flags;
> > +};
> > +\end{lstlisting}
> > +
> > +\end{description}
> > +
> > +\drivernormative{\subparagraph}{RoCE Configuration}{Device Types / Network Device / Device Operation / Control Virtqueue / RoCE Configuration}
> > +
> > +A driver MUST initialize the completion virtqueue and fill it with
> > +enough entries after command VIRTIO_NET_CTRL_ROCE_CREATE_CQ is
> > +successfully executed.
> > +
> > +A driver MUST reset the completion virtqueue after
>
> How to do the reset? Do you mean driver need to reset the indices?
>

Yes, something like avail_idx, used_idx.

> > +command VIRTIO_NET_CTRL_ROCE_DESTROY_CQ is successfully executed.
> > +
> > +A driver MUST initialize the send virtqueue and receive virtqueue after
> > +command VIRTIO_NET_CTRL_ROCE_CREATE_QP is successfully executed.
> > +
> > +A driver MUST reset the send virtqueue and receive virtqueue after
> > +command VIRTIO_NET_CTRL_ROCE_DESTROY_QP is successfully executed.
> >
> >  \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device
> >  Types / Network Device / Legacy Interface: Framing Requirements}
> > @@ -4496,6 +5063,289 @@ \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device
> >  See \ref{sec:Basic
> >  Facilities of a Virtio Device / Virtqueues / Message Framing}.
> >
> > +\subsubsection{RoCE Support}\label{sec:Device Types / Network Device / Device Operation / RoCE Support}
> > +
> > +RDMA over Converged Ethernet (RoCE) is a network protocol that allows
> > +remote direct memory access (RDMA) over an Ethernet network. To support
> > +RoCE (if VIRTIO_NET_F_ROCE is negotiated), in addtion to the control
> > +virtqueue support mentioned in \ref{sec:Device Types / Network Device /
> > +Device Operation / Control Virtqueue / RoCE Configuration}, multiple
> > +types of virtqueues including send virtqueue, receive virtqueue and
> > +completion virtqueue are introduced.
> > +
> > +The send virtqueue contains elements that describe the data to be
> > +transmitted.
> > +
> > +Requests (device-readable) have the following format:
> > +
> > +\begin{lstlisting}
> > +enum virtio_ib_wr_opcode {
> > +        VIRTIO_IB_WR_RDMA_WRITE,
> > +        VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM,
> > +        VIRTIO_IB_WR_SEND,
> > +        VIRTIO_IB_WR_SEND_WITH_IMM,
> > +        VIRTIO_IB_WR_RDMA_READ,
> > +};
> > +
> > +struct virtio_rdma_sge {
> > +        le64 addr;
> > +        le32 length;
> > +        le32 lkey;
> > +};
> > +
> > +struct virtio_rdma_sq_req {
> > +        /* User defined WR ID */
> > +        le64 wr_id;
> > +        /* WR opcode, enum virtio_ib_wr_opcode */
> > +        u8 opcode;
> > +#define VIRTIO_IB_SEND_FENCE        (1 << 0)
> > +#define VIRTIO_IB_SEND_SIGNALED     (1 << 1)
> > +#define VIRTIO_IB_SEND_SOLICITED    (1 << 2)
> > +#define VIRTIO_IB_SEND_INLINE       (1 << 3)
> > +        /* Flags of the WR properties */
> > +        u8 send_flags;
> > +        /* Padding */
> > +        le16 padding;
> > +        /* Immediate data (in network byte order) to send */
> > +        le32 imm_data;
> > +        union {
> > +                struct {
> > +                        /* Start address of remote memory buffer */
> > +                        le64 remote_addr;
> > +                        /* Key of the remote MR */
> > +                        le32 rkey;
> > +                } rdma;
> > +                struct {
> > +                        /* Index of the destination QP */
> > +                        le32 remote_qpn;
> > +                        /* Q_Key of the destination QP */
> > +                        le32 remote_qkey;
> > +                        /* Address Handle */
> > +                        le32 ah;
> > +                } ud;
> > +                /* Reserved for future */
> > +                le64 reserved[4];
> > +        };
> > +        /* Inline data */
> > +        u8 inline_data[512];
> > +        union {
> > +                /* Length of sg_list */
> > +                le32 num_sge;
> > +                /* Length of inline data */
> > +                le16 inline_len;
> > +        };
> > +        /* Reserved for future */
> > +        le32 reserved2[3];
> > +       /* Scatter/gather list */
> > +        struct virtio_rdma_sge sg_list[];
> > +};
> > +\end{lstlisting}
> > +
> > +The receive virtqueue contains elements that describe where to place incoming data.
> > +
> > +Requests (device-readable) have the following format:
> > +
> > +\begin{lstlisting}
> > +struct virtio_rdma_rq_req {
> > +        /* User defined WR ID */
> > +        le64 wr_id;
> > +        /* Length of sg_list */
> > +        le32 num_sge;
> > +        /* Reserved for future */
> > +        le32 reserved[3];
> > +        /* Scatter/gather list */
> > +        struct virtio_rdma_sge sg_list[];
> > +};
> > +\end{lstlisting}
> > +
> > +The completion virtqueue is used to notify the completion of requests in
> > +send virtqueue or receive virtqueue.
> > +
> > +Requests (device-writable) have the following format:
> > +
> > +\begin{lstlisting}
> > +enum virtio_ib_wc_opcode {
> > +        VIRTIO_IB_WC_SEND,
> > +        VIRTIO_IB_WC_RDMA_WRITE,
> > +        VIRTIO_IB_WC_RDMA_READ,
> > +        VIRTIO_IB_WC_RECV,
> > +        VIRTIO_IB_WC_RECV_RDMA_WITH_IMM,
> > +};
> > +
> > +enum virtio_ib_wc_status {
> > +        /* Operation completed successfully */
> > +        VIRTIO_IB_WC_SUCCESS,
> > +        /* Local Length Error */
> > +        VIRTIO_IB_WC_LOC_LEN_ERR,
> > +        /* Local QP Operation Error */
> > +        VIRTIO_IB_WC_LOC_QP_OP_ERR,
> > +        /* Local Protection Error */
> > +        VIRTIO_IB_WC_LOC_PROT_ERR,
> > +        /* Work Request Flushed Error */
> > +        VIRTIO_IB_WC_WR_FLUSH_ERR,
> > +        /* Bad Response Error */
> > +        VIRTIO_IB_WC_BAD_RESP_ERR,
> > +        /* Local Access Error */
> > +        VIRTIO_IB_WC_LOC_ACCESS_ERR,
> > +        /* Remote Invalid Request Error */
> > +        VIRTIO_IB_WC_REM_INV_REQ_ERR,
> > +        /* Remote Access Error */
> > +        VIRTIO_IB_WC_REM_ACCESS_ERR,
> > +        /* Remote Operation Error */
> > +        VIRTIO_IB_WC_REM_OP_ERR,
> > +        /* Transport Retry Counter Exceeded */
> > +        VIRTIO_IB_WC_RETRY_EXC_ERR,
> > +        /* RNR Retry Counter Exceeded */
> > +        VIRTIO_IB_WC_RNR_RETRY_EXC_ERR,
> > +        /* Remote Aborted Error */
> > +        VIRTIO_IB_WC_REM_ABORT_ERR,
> > +        /* Fatal Error */
> > +        VIRTIO_IB_WC_FATAL_ERR,
> > +        /* Response Timeout Error */
> > +        VIRTIO_IB_WC_RESP_TIMEOUT_ERR,
> > +        /* General Error */
> > +        VIRTIO_IB_WC_GENERAL_ERR
> > +};
> > +
> > +struct virtio_rdma_cq_req {
> > +        /* User defined WR ID */
> > +        le64 wr_id;
> > +        /* Work completion status, enum virtio_ib_wc_status */
> > +        u8 status;
> > +        /* WR opcode, enum virtio_ib_wc_opcode */
> > +        u8 opcode;
> > +        /* Padding */
> > +        le16 padding;
> > +        /* Vendor error */
> > +        le32 vendor_err;
> > +        /* Number of bytes transferred */
> > +        le32 byte_len;
> > +        /* Immediate data (in network byte order) to send */
> > +        le32 imm_data;
> > +        /* Local QP number of completed WR */
> > +        le32 qp_num;
> > +        /* Source QP number (remote QP number) of completed WR (valid only for UD QPs) */
> > +        le32 src_qp;
> > +#define VIRTIO_IB_WC_GRH         (1 << 0)
> > +#define VIRTIO_IB_WC_WITH_IMM    (1 << 1)
> > +        /* Work completion flag */
> > +        le32 wc_flags;
> > +        /* Reserved for future */
> > +        le32 reserved[3];
> > +};
> > +\end{lstlisting}
> > +
> > +\paragraph{Send Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Send Operation}
> > +
> > +The send operation allows us to send data to a remote QP’s Receive Queue.
> > +The receiver MUST have previously posted a receive buffer to receive the data.
>
> "MUST" keyword must belong to the normative section.
>

OK.

> > +
> > +To do a send operation, a request with \field{opcode} set to
> > +VIRTIO_IB_WR_SEND or VIRTIO_IB_WR_SEND_WITH_IMM MUST be posted to the Send
> > +Queue as one output descriptor and the device is notified of the new entry.
> > +
> > +\drivernormative{\subparagraph}{Send Operation}{Device Types / Network Device / Device Operation / RoCE Support / Send Operation}
> > +
> > +If VIRTIO_IB_SEND_INLINE is set in \field{send_flags}, the driver MUST fill
> > +send buffer into \field{inline_data} field and set \field{inline_len} to the
> > +length of the buffer. Otherwise, the driver MUST fill \field{sg_list} to
> > +describe the buffer.
> > +
> > +\devicenormative{\subparagraph}{Send Operation}{Device Types / Network Device / Device Operation / RoCE Support / Send Operation}
> > +
> > +If \field{opcode} is not set to VIRTIO_IB_WR_SEND_WITH_IMM, the device MUST
> > +ignore \field{imm_data}.
> > +
> > +If the QP type is UD, the device MUST validate \field{ud.ah}.
> > +
> > +If VIRTIO_IB_SEND_INLINE is not set in \field{send_flags}, the device MUST
> > +validate the \field{addr}, \field{length} and \field{lkey} in \field{sg_list}.
> > +
> > +\paragraph{Receive Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Receive Operation}
> > +
> > +The receive operation allows us to receive data from remote QP.
> > +It's the corresponding operation to a send operation.
> > +
> > +To do a receive operation, a request MUST be posted to the Receive
> > +Queue as one output descriptor and the device is notified of the new entry.
> > +
>
> I think we probably need to be more verbose as what has been done for
> virtio-net.
>
> That is, describe what need to be filled in virtio_rdma_rq_req in
> details. (And do this for other operation as well)
>

OK.

>
> > +\drivernormative{\subparagraph}{Receive Operation}{Device Types / Network Device / Device Operation / RoCE Support / Receive Operation}
> > +
> > +The driver MUST fill \field{sg_list} to describe the receive buffer.
> > +
> > +\devicenormative{\subparagraph}{Receive Operation}{Device Types / Network Device / Device Operation / RoCE Support / Receive Operation}
> > +
> > +The device MUST validate the \field{addr}, \field{length} and \field{lkey}
> > +in \field{sg_list}.
> > +
> > +\paragraph{Write Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Write Operation}
> > +
> > +The write operation allows us to write data to the local memory buffer
> > +in remote side with no notification. The remote side wouldn't be aware
> > +that this operation being done.
> > +
> > +To do a write operation, a request with \field{opcode} set to
> > +VIRTIO_IB_WR_RDMA_WRITE or VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM MUST be
> > +posted to the Send Queue as one output descriptor and the device is
> > +notified of the new entry.
> > +
> > +\drivernormative{\subparagraph}{Write Operation}{Device Types / Network Device / Device Operation / RoCE Support / Write Operation}
> > +
> > +The driver MUST fill \field{sg_list} to describe the write buffer.
>
> So sg is a must even if the driver want to use imm?
>

Looks like not. I will fix it.

> > +
> > +The driver MUST fill \field{rdma.remote_addr} and \field{rdma.rkey} to
> > +identify the remote buffer.
> > +
> > +\devicenormative{\subparagraph}{Write Operation}{Device Types / Network Device / Device Operation / RoCE Support / Write Operation}
> > +
> > +If \field{opcode} is not set to VIRTIO_IB_WR_RDMA_WRITE_WITH_IMM, the device
> > +MUST ignore \field{imm_data}.
> > +
> > +The device MUST validate the \field{addr}, \field{length} and \field{lkey}
> > +in \field{sg_list}.
> > +
> > +\paragraph{Read Operation}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Read Operation}
> > +
> > +The read operation allows us to read data from the local memory buffer
> > +in remote side with no notification. The remote side wouldn't be aware
> > +that this operation being done.
> > +
> > +To do a read operation, a request with \field{opcode} set to
> > +VIRTIO_IB_WR_RDMA_READ MUST be posted to the Send Queue as one output
> > +descriptor and the device is notified of the new entry.
> > +
> > +\drivernormative{\subparagraph}{Read Operation}{Device Types / Network Device / Device Operation / RoCE Support / Read Operation}
> > +
> > +The driver MUST fill \field{sg_list} to describe the read buffer.
> > +
> > +The driver MUST fill \field{rdma.remote_addr} and \field{rdma.rkey} to
> > +identify the remote buffer.
> > +
> > +\devicenormative{\subparagraph}{Read Operation}{Device Types / Network Device / Device Operation / RoCE Support / Read Operation}
> > +
> > +The device MUST validate the \field{addr}, \field{length} and \field{lkey}
> > +in \field{sg_list}.
> > +
> > +\paragraph{Completion Notification}\label{sec:Device Types / Network Device / Device Operation / RoCE Support / Completion Notification}
> > +
> > +After above operation is completed, a completion notification MUST
> > +be triggered by the device.
>
> For "completion notification", do you mean the virtqueue notification
> of cq or the making the buffer than contains cqe used?
>

Both? Making the buffer that contains cqe used and notify the virtqueue.

> > To achieve that, the device MUST consume
> > +an entry of the Completion Queue associated with the Send Queue/Receive
> > +Queue which the operation belongs to.
> > +
> > +\drivernormative{\subparagraph}{Completion Notification}{Device Types / Network Device / Device Operation / RoCE Support / Completion Notification}
> > +
> > +The driver MUST fill the Completion Queue with enough entries previously.
>
> What do you mean by "previously"? What happens if there's no sufficient cqe?
>

We need to fill the Completion Queue in advance. Otherwise, the driver
would not get completion notification after some operation is
completed.

Thanks,
Yongji




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux