README with description of major sysfs entries, sysfs documentation has been moved to ABI dir as suggested by Bart. Signed-off-by: Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> Signed-off-by: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Cc: linux-kernel@xxxxxxxxxxxxxxx --- .../ABI/testing/sysfs-class-rtrs-client | 131 +++++++++++ .../ABI/testing/sysfs-class-rtrs-server | 53 +++++ drivers/infiniband/ulp/rtrs/README | 213 ++++++++++++++++++ 3 files changed, 397 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-class-rtrs-client create mode 100644 Documentation/ABI/testing/sysfs-class-rtrs-server create mode 100644 drivers/infiniband/ulp/rtrs/README diff --git a/Documentation/ABI/testing/sysfs-class-rtrs-client b/Documentation/ABI/testing/sysfs-class-rtrs-client new file mode 100644 index 000000000000..e7e718db8941 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-rtrs-client @@ -0,0 +1,131 @@ +What: /sys/class/rtrs-client +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: When a user of RTRS API creates a new session, a directory entry with + the name of that session is created under /sys/class/rtrs-client/<session-name>/ + +What: /sys/class/rtrs-client/<session-name>/add_path +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: RW, adds a new path (connection) to an existing session. Expected format is the + following: + + <[source addr,]destination addr> + *addr ::= [ ip:<ipv4|ipv6> | gid:<gid> ] + +What: /sys/class/rtrs-client/<session-name>/max_reconnect_attempts +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: Maximum number reconnect attempts the client should make before giving up + after connection breaks unexpectedly. + +What: /sys/class/rtrs-client/<session-name>/mp_policy +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: Multipath policy specifies which path should be selected on each IO: + + round-robin (0): + select path in per CPU round-robin manner. + + min-inflight (1): + select path with minimum inflights. + +What: /sys/class/rtrs-client/<session-name>/paths/ +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: Each path belonging to a given session is listed here by its source and + destination address. When a new path is added to a session by writing to + the "add_path" entry, a directory <src@dst> is created. + +What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/state +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: RO, Contains "connected" if the session is connected to the peer and fully + functional. Otherwise the file contains "disconnected" + +What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/reconnect +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: Write "1" to the file in order to reconnect the path. + Operation is blocking and returns 0 if reconnect was successful. + +What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/disconnect +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: Write "1" to the file in order to disconnect the path. + Operation blocks until RTRS path is disconnected. + +What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/remove_path +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: Write "1" to the file in order to disconnected and remove the path + from the session. Operation blocks until the path is disconnected + and removed from the session. + +What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/hca_name +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: RO, Contains the the name of HCA the connection established on. + +What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/hca_port +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: RO, Contains the port number of active port traffic is going through. + +What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/src_addr +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: RO, Contains the source address of the path + +What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/dst_addr +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: RO, Contains the destination address of the path + +What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/stats/reset_all +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: RW, Read will return usage help, write 0 will clear all the statistics. + +What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/stats/cpu_migration +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: RTRS expects that each HCA IRQ is pinned to a separate CPU. If it's + not the case, the processing of an I/O response could be processed on a + different CPU than where it was originally submitted. This file shows + how many interrupts where generated on a non expected CPU. + "from:" is the CPU on which the IRQ was expected, but not generated. + "to:" is the CPU on which the IRQ was generated, but not expected. + +What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/stats/reconnects +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: Contains 2 unsigned int values, the first one records number of successful + reconnects in the path lifetime, the second one records number of failed + reconnects in the path lifetime. + +What: /sys/class/rtrs-client/<session-name>/paths/<src@dst>/stats/rdma +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: Contains statistics regarding rdma operations and inflight operations. + The output consists of 6 values: + + <read-count> <read-total-size> <write-count> <write-total-size> \ + <inflights> <failovered> diff --git a/Documentation/ABI/testing/sysfs-class-rtrs-server b/Documentation/ABI/testing/sysfs-class-rtrs-server new file mode 100644 index 000000000000..3b6d5b067df0 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-rtrs-server @@ -0,0 +1,53 @@ +What: /sys/class/rtrs-server +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: When a user of RTRS API creates a new session on a client side, a + directory entry with the name of that session is created in here. + +What: /sys/class/rtrs-server/<session-name>/paths/ +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: When new path is created by writing to "add_path" entry on client side, + a directory entry named as <source address>@<destination address> is created + on server. + +What: /sys/class/rtrs-server/<session-name>/paths/<src@dst>/disconnect +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: When "1" is written to the file, the RTRS session is being disconnected. + Operations is non-blocking and returns control immediately to the caller. + +What: /sys/class/rtrs-server/<session-name>/paths/<src@dst>/hca_name +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: RO, Contains the the name of HCA the connection established on. + +What: /sys/class/rtrs-server/<session-name>/paths/<src@dst>/hca_port +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: RO, Contains the port number of active port traffic is going through. + +What: /sys/class/rtrs-server/<session-name>/paths/<src@dst>/src_addr +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: RO, Contains the source address of the path + +What: /sys/class/rtrs-server/<session-name>/paths/<src@dst>/dst_addr +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: RO, Contains the destination address of the path + +What: /sys/class/rtrs-server/<session-name>/paths/<src@dst>/stats/rdma +Date: Feb 2020 +KernelVersion: 5.7 +Contact: Jack Wang <jinpu.wang@xxxxxxxxxxxxxxx> Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxx> +Description: Contains statistics regarding rdma operations and inflight operations. + The output consists of 5 values: + <read-count> <read-total-size> <write-count> <write-total-size> <inflights> diff --git a/drivers/infiniband/ulp/rtrs/README b/drivers/infiniband/ulp/rtrs/README new file mode 100644 index 000000000000..5d9ea142e5dd --- /dev/null +++ b/drivers/infiniband/ulp/rtrs/README @@ -0,0 +1,213 @@ +**************************** +RDMA Transport (RTRS) +**************************** + +RTRS (RDMA Transport) is a reliable high speed transport library +which provides support to establish optimal number of connections +between client and server machines using RDMA (InfiniBand, RoCE, iWarp) +transport. It is optimized to transfer (read/write) IO blocks. + +In its core interface it follows the BIO semantics of providing the +possibility to either write data from an sg list to the remote side +or to request ("read") data transfer from the remote side into a given +sg list. + +RTRS provides I/O fail-over and load-balancing capabilities by using +multipath I/O (see "add_path" and "mp_policy" configuration entries in +Documentation/ABI/testing/sysfs-class-rtrs-client). + +RTRS is used by the RNBD (RDMA Network Block Device) modules. + +================== +Transport protocol +================== + +Overview +-------- +An established connection between a client and a server is called rtrs +session. A session is associated with a set of memory chunks reserved on the +server side for a given client for rdma transfer. A session +consists of multiple paths, each representing a separate physical link +between client and server. Those are used for load balancing and failover. +Each path consists of as many connections (QPs) as there are cpus on +the client. + +When processing an incoming write or read request, rtrs client uses memory +chunks reserved for him on the server side. Their number, size and addresses +need to be exchanged between client and server during the connection +establishment phase. Apart from the memory related information client needs to +inform the server about the session name and identify each path and connection +individually. + +On an established session client sends to server write or read messages. +Server uses immediate field to tell the client which request is being +acknowledged and for errno. Client uses immediate field to tell the server +which of the memory chunks has been accessed and at which offset the message +can be found. + +Module parameter always_invalidate is introduced for the security problem +discussed in LPC RDMA MC 2019. When always_invalidate=Y, on the server side we +invalidate each rdma buffer before we hand it over to RNBD server and +then pass it to the block layer. A new rkey is generated and registered for the +buffer after it returns back from the block layer and RNBD server. +The new rkey is sent back to the client along with the IO result. +The procedure is the default behaviour of the driver. This invalidation and +registration on each IO causes performance drop of up to 20%. A user of the +driver may choose to load the modules with this mechanism switched off +(always_invalidate=N), if he understands and can take the risk of a malicious +client being able to corrupt memory of a server it is connected to. This might +be a reasonable option in a scenario where all the clients and all the servers +are located within a secure datacenter. + + +Connection establishment +------------------------ + +1. Client starts establishing connections belonging to a path of a session one +by one via attaching RTRS_MSG_CON_REQ messages to the rdma_connect requests. +Those include uuid of the session and uuid of the path to be +established. They are used by the server to find a persisting session/path or +to create a new one when necessary. The message also contains the protocol +version and magic for compatibility, total number of connections per session +(as many as cpus on the client), the id of the current connection and +the reconnect counter, which is used to resolve the situations where +client is trying to reconnect a path, while server is still destroying the old +one. + +2. Server accepts the connection requests one by one and attaches +RTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and +protocol version, the messages include error code, queue depth supported by +the server (number of memory chunks which are going to be allocated for that +session) and the maximum size of one io, RTRS_MSG_NEW_RKEY_F flags is set +when always_invalidate=Y. + +3. After all connections of a path are established client sends to server the +RTRS_MSG_INFO_REQ message, containing the name of the session. This message +requests the address information from the server. + +4. Server replies to the session info request message with RTRS_MSG_INFO_RSP, +which contains the addresses and keys of the RDMA buffers allocated for that +session. + +5. Session becomes connected after all paths to be established are connected +(i.e. steps 1-4 finished for all paths requested for a session) + +6. Server and client exchange periodically heartbeat messages (empty rdma +messages with an immediate field) which are used to detect a crash on remote +side or network outage in an absence of IO. + +7. On any RDMA related error or in the case of a heartbeat timeout, the +corresponding path is disconnected, all the inflight IO are failed over to a +healthy path, if any, and the reconnect mechanism is triggered. + +CLT SRV +*for each connection belonging to a path and for each path: +RTRS_MSG_CON_REQ -------------------> + <------------------- RTRS_MSG_CON_RSP +... +*after all connections are established: +RTRS_MSG_INFO_REQ -------------------> + <------------------- RTRS_MSG_INFO_RSP +*heartbeat is started from both sides: + -------------------> [RTRS_HB_MSG_IMM] +[RTRS_HB_MSG_ACK] <------------------- +[RTRS_HB_MSG_IMM] <------------------- + -------------------> [RTRS_HB_MSG_ACK] + +IO path +------- + +* Write (always_invalidate=N) * + +1. When processing a write request client selects one of the memory chunks +on the server side and rdma writes there the user data, user header and the +RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only +contains size of the user header. The client tells the server which chunk has +been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by +using the IMM field. + +2. When confirming a write request server sends an "empty" rdma message with +an immediate field. The 32 bit field is used to specify the outstanding +inflight IO and for the error code. + +CLT SRV +usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM] +[RTRS_IO_RSP_IMM] <----------------- (id + errno) + +* Write (always_invalidate=Y) * + +1. When processing a write request client selects one of the memory chunks +on the server side and rdma writes there the user data, user header and the +RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only +contains size of the user header. The client tells the server which chunk has +been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by +using the IMM field, Server invalidate rkey associated to the memory chunks +first, when it finishes, pass the IO to RNBD server module. + +2. When confirming a write request server sends an "empty" rdma message with +an immediate field. The 32 bit field is used to specify the outstanding +inflight IO and for the error code. The new rkey is sent back using +SEND_WITH_IMM WR, client When it recived new rkey message, it validates +the message and finished IO after update rkey for the rbuffer, then post +back the recv buffer for later use. + +CLT SRV +usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM] +[RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP) +[RTRS_IO_RSP_IMM] <----------------- (id + errno) + + +* Read (always_invalidate=N)* + +1. When processing a read request client selects one of the memory chunks +on the server side and rdma writes there the user header and the +RTRS_MSG_RDMA_READ message. This message contains the type (read), size of +the user header, flags (specifying if memory invalidation is necessary) and the +list of addresses along with keys for the data to be read into. + +2. When confirming a read request server transfers the requested data first, +attaches an invalidation message if requested and finally an "empty" rdma +message with an immediate field. The 32 bit field is used to specify the +outstanding inflight IO and the error code. + +CLT SRV +usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM] +[RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno) +or in case client requested invalidation: +[RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno) + +* Read (always_invalidate=Y)* + +1. When processing a read request client selects one of the memory chunks +on the server side and rdma writes there the user header and the +RTRS_MSG_RDMA_READ message. This message contains the type (read), size of +the user header, flags (specifying if memory invalidation is necessary) and the +list of addresses along with keys for the data to be read into. +Server invalidate rkey associated to the memory chunks first, when it finishes, +passes the IO to RNBD server module. + +2. When confirming a read request server transfers the requested data first, +attaches an invalidation message if requested and finally an "empty" rdma +message with an immediate field. The 32 bit field is used to specify the +outstanding inflight IO and the error code. The new rkey is sent back using +SEND_WITH_IMM WR, client When it recived new rkey message, it validates +the message and finished IO after update rkey for the rbuffer, then post +back the recv buffer for later use. + +CLT SRV +usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM] +[RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno) +[RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP) +or in case client requested invalidation: +[RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno) +========================================= +Contributors List(in alphabetical order) +========================================= +Danil Kipnis <danil.kipnis@xxxxxxxxxxxxxxxx> +Fabian Holler <mail@xxxxxxxxxx> +Guoqing Jiang <guoqing.jiang@xxxxxxxxxxxxxxx> +Jack Wang <jinpu.wang@xxxxxxxxxxxxxxxx> +Kleber Souza <kleber.souza@xxxxxxxxxxxxxxxx> +Lutz Pogrell <lutz.pogrell@xxxxxxxxxxxxxxx> +Milind Dumbare <Milind.dumbare@xxxxxxxxx> +Roman Penyaev <roman.penyaev@xxxxxxxxxxxxxxxx> -- 2.17.1