On Fri, Mar 19, 2021 at 08:56:27AM -0400, kaike.wan@xxxxxxxxx wrote: > From: Kaike Wan <kaike.wan@xxxxxxxxx> > > The RDMA Rendezvous (rv) module provides an interface for HPC > middlewares to improve performance by caching memory region > registration, and improve the scalibity of RDMA transaction > through connection managements between nodes. This mechanism > is implemented through the following ioctl requests: > - ATTACH: to attach to an RDMA device. > - REG_MEM: to register a user/kernel memory region. > - DEREG_MEM: to release application use of MR, allowing it to > remain in cache. > - GET_CACHE_STATS: to get cache statistics. > - CONN_CREATE: to create an RC connection. > - CONN_CONNECT: to start the connection. > - CONN_GET_CONN_COUNT: to use as part of error recovery from lost > messages in the application. > - CONN_GET_STATS: to get connection statistics. > - GET_EVENT_STATS: to get the RDMA event statistics. > - POST_RDMA_WR_IMMED: to post an RDMA WRITE WITH IMMED request. > > Signed-off-by: Todd Rimmer <todd.rimmer@xxxxxxxxx> > Signed-off-by: Kaike Wan <kaike.wan@xxxxxxxxx> > include/uapi/rdma/rv_user_ioctls.h | 558 +++++++++++++++++++++++++++++ > 1 file changed, 558 insertions(+) > create mode 100644 include/uapi/rdma/rv_user_ioctls.h > > diff --git a/include/uapi/rdma/rv_user_ioctls.h b/include/uapi/rdma/rv_user_ioctls.h > new file mode 100644 > index 000000000000..97e35b722443 > +++ b/include/uapi/rdma/rv_user_ioctls.h > @@ -0,0 +1,558 @@ > +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */ > +/* > + * Copyright(c) 2020 - 2021 Intel Corporation. > + */ > +#ifndef __RV_USER_IOCTL_H__ > +#define __RV_USER_IOCTL_H__ > +#include <rdma/rdma_user_ioctl.h> > +#include <rdma/ib_user_sa.h> > +#include <rdma/ib_user_verbs.h> > + > +/* Checking /Documentation/userspace-api/ioctl/ioctl-number.rst */ > +#define RV_MAGIC RDMA_IOCTL_MAGIC No. > +#define RV_FILE_NAME "/dev/rv" No. > + > +/* > + * Handles are opaque to application; they are meaningful only to the > + * RV driver > + */ > + > +/* this version of ABI */ > +#define RV_ABI_VER_MAJOR 1 > +#define RV_ABI_VER_MINOR 0 No, see my remarks to your other intel colleagues doing the ioasid stuff. > +struct rv_query_params_out { > + /* ABI version */ > + __u16 major_rev; > + __u16 minor_rev; > + __u32 resv1; > + __aligned_u64 capability; > + __aligned_u64 resv2[6]; No pre-adding reserved stuff > +}; > + > +#define RV_IOCTL_QUERY _IOR(RV_MAGIC, 0xFC, struct rv_query_params_out) > + > +/* Mode for use of rv module by PSM */ > +#define RV_RDMA_MODE_USER 0 /* user MR caching only */ > +#define RV_RDMA_MODE_KERNEL 1 /* + kernel RC QPs with kernel MR caching */ Huh? Mode sounds bad. > +/* > + * mr_cache_size is in MBs and if 0 will use module param as default > + * num_conn - number of QPs between each pair of nodes > + * loc_addr - used to select client/listen vs rem_addr > + * index_bits - num high bits of immed data with rv index > + * loc_gid_index - SGID for client connections > + * loc_gid[16] - to double check gid_index unchanged > + * job_key[RV_MAX_JOB_KEY_LEN] = unique uuid per job > + * job_key_len - len, if 0 matches jobs with len==0 only > + * q_depth - size of QP and per QP CQs > + * reconnect_timeout - in seconds from loss to restoration > + * hb_interval - in milliseconds between heartbeats > + */ > +struct rv_attach_params_in { > + char dev_name[RV_MAX_DEV_NAME_LEN]; Guessing this is a no too. > + __u32 mr_cache_size; > + __u8 rdma_mode; > + > + /* additional information for RV_RDMA_MODE_KERNEL */ > + __u8 port_num; > + __u8 num_conn; Lots of alignment holes, don't do that either. > +struct rv_attach_params { > + union { > + struct rv_attach_params_in in; > + struct rv_attach_params_out out; > + }; > +}; Yikes, no > +/* The buffer is used to register a kernel mr */ > +#define IBV_ACCESS_KERNEL 0x80000000 Huh? WTF on on many levels > +/* > + * ibv_pd_handle - user space appl allocated pd > + * ulen - driver_udata inlen > + * *udata - driver_updata inbuf > + */ > +struct rv_mem_params_in { > + __u32 ibv_pd_handle; > + __u32 cmd_fd_int; Really? You want to reach into the command FD from a ULP and extract objects? Double yikes. Did you do this properly, taking care of every lifetime rule and handling disassociation? > + __aligned_u64 addr; > + __aligned_u64 length; > + __u32 access; > + size_t ulen; > + void *udata; 'void *' is wrong for ioctls. > +struct rv_conn_get_stats_params_out { Too many stats, don't you think? Most of the header seems to be stats of one thing or another. > +/* > + * events placed on ring buffer for delivery to user space. > + * Carefully sized to be a multiple of 64 bytes for cache alignment. > + * Must pack to get good field alignment and desired 64B overall size > + * Unlike verbs, all rv_event fields are defined even when > + * rv_event.wc.status != IB_WC_SUCCESS. Only sent writes can report bad status. > + * event_type - enum rv_event_type > + * wc - send or recv work completions > + * status - ib_wc_status > + * resv1 - alignment > + * imm_data - for RV_WC_RECV_RDMA_WITH_IMM only > + * wr_id - PSM wr_id for RV_WC_RDMA_WRITE only > + * conn_handle - conn handle. For efficiency in completion processing, this > + * handle is the rv_conn handle, not the rv_user_conn. > + * Main use is sanity checks. On Recv PSM must use imm_data to > + * efficiently identify source. > + * byte_len - unlike verbs API, this is always valid > + * resv2 - alignment > + * cache_align - not used, but forces overall struct to 64B size > + */ > +struct rv_event { > + __u8 event_type; > + union { > + struct { > + __u8 status; > + __u16 resv1; > + __u32 imm_data; > + __aligned_u64 wr_id; > + __aligned_u64 conn_handle; > + __u32 byte_len; > + __u32 resv2; > + } __attribute__((__packed__)) wc; > + struct { > + __u8 pad[7]; > + uint64_t pad2[7]; > + } __attribute__((__packed__)) cache_align; > + }; > +} __attribute__((__packed__)); Uhh, mixing packed and aligned_u64 is pretty silly. I don't think this needs to be packed or written in this tortured way. > + > +/* > + * head - consumer removes here > + * tail - producer adds here > + * overflow_cnt - number of times producer overflowed ring and discarded > + * pad - 64B cache alignment for entries > + */ > +struct rv_ring_header { > + volatile __u32 head; > + volatile __u32 tail; > + volatile __u64 overflow_cnt; No on volatile, and missed a __aligned here This uapi needs to be much better. It looks like the mess from the PSM char dev just re-imported here. At the very least split the caching from the other operations and follow normal ioctl design And you need to rethink the uverbs stuff. Jason