On 2016/07/06 17:29, Ursula Braun wrote: > Dave, > > we still like to see SMC-R included into a future Linux-kernel. After > answering your first 2 questions, there is no longer a response. What should > we do next? > - Still wait for an answer from you? > - Resend the same whole SMC-R patch series, this time with the cover letter > adapted to your requested changes? ^^^ I would suggest to send v2 of the patch series with the changes that were requested. > - Put the SMC-R development on hold, and concentrate on another > s390-specific SMC-solution first (not RDMA-based), that makes use of the > SMC-socket family as well. > - Anything else? > > Kind regards, Ursula > > -------- Forwarded Message -------- > Subject: Fwd: Re: [PATCH net-next 00/15] net/smc: Shared Memory > Communications - RDMA > Date: Tue, 21 Jun 2016 16:02:59 +0200 > From: Ursula Braun <ubraun@xxxxxxxxxxxxxxxxxx> > To: davem@xxxxxxxxxxxxx > CC: netdev@xxxxxxxxxxxxxxx, linux-s390@xxxxxxxxxxxxxxx, > schwidefsky@xxxxxxxxxx, heiko.carstens@xxxxxxxxxx, utz.bacher@xxxxxxxxxx > > Dave, > > the SMC-R patches submitted 2016-06-03 show up in state "Changes > Requested" on patchwork: > https://patchwork.ozlabs.org/project/netdev/list/?submitter=2266&state=*&page=1 > > You had requested a change of the SMC-R description in the cover letter. > We came up with the response below. Do you need anything else from us? > > Kind regards, > Ursula Braun > > -------- Forwarded Message -------- > Subject: Re: [PATCH net-next 00/15] net/smc: Shared Memory > Communications - RDMA > Date: Thu, 9 Jun 2016 17:36:28 +0200 > From: Ursula Braun <ubraun@xxxxxxxxxxxxxxxxxx> > To: davem@xxxxxxxxxxxxx > CC: netdev@xxxxxxxxxxxxxxx, linux-s390@xxxxxxxxxxxxxxx, > schwidefsky@xxxxxxxxxx, heiko.carstens@xxxxxxxxxx > > On Tue, 2016-06-07 at 15:07 -0700, David Miller wrote: > > In case my previous reply wasn't clear enough, I require that you provide > > a more accurate description of what the implications of this feature are. > > > > Namely, that important _CORE_ networking features are completely bypassed > > and unusable when SMC applies to a connection. > > > > Specifically, all packet shaping, filtering, traffic inspection, and > > flow management facilitites in the kernel will not be able to see nor > > act upon the data flow of these TCP connections once established. > > > > It is always important, and in my opinion required, to list the > > negative aspects of your change and not just the "wow, amazing" > > positive aspects. > > > > Thanks. > > > > > Correct, the SMC-R data stream bypasses TCP and thus cannot enjoy its > features. This is the price for leveraging the TCP application ecosystem > and reducing CPU load. > > When a load balancer allows the TCP handshake to take place between a > worker node and the TCP client, RDMA will be used between these two > nodes. So anything based on TCP connection establishment (including a > firewall) can apply to SMC-R, too. To be clear -- yes, the data flow > later on is not subject to these features anymore. At least VLAN > isolation from the TCP part can be leveraged for RDMA traffic. From our > experience, discussions, etc., that tradeoff seems acceptable in a > classical data center environment. > > Improving our cover letter would result in the following new > introductory motivation part at the beginning and a slightly modified > list of > planned enhancements at the end: > > On Fri, 2016-06-03 at 17:26 +0200, Ursula Braun wrote: > > > These patches are the initial part of the implementation of the > > "Shared Memory Communications-RDMA" (SMC-R) protocol. The protocol is > > defined in RFC7609 [1]. It allows transformation of TCP connections > > using the "Remote Direct Memory Access over Converged Ethernet" (RoCE) > > feature of specific communication hardware for data center environments. > > > > SMC-R inherits TCP qualities such as reliable connections, host-based > > firewall packet filtering (on connection establishment) and unmodified > > application of communication encryption such as TLS (transport layer > > security) or SSL (secure sockets layer). It is transparent to most existing > > TCP connection load balancers that are commonly used in the enterprise data > > center environment for multi-tier application workloads. > > > > Being designed for the data center network switched fabric environment, it > > does not need congestion control and thus reaches line speed right away > > without having to go through slow start as with TCP. This can be beneficial > > for short living flows including request response patterns requiring > > reliability. A full SMC-R implementation also provides seamless high > > availability and load-balancing demanded by enterprise installations. > > > > SMC-R does not require an RDMA communication manager (RDMA CM). Its use of > > RDMA provides CPU savings transparently for unmodified applications. > > For instance, when running 10 parallel connections with uperf, we measured > > a decrease of 60% in CPU consumption with SMC-R compared to TCP/IP > > (with throughput and latency comparable; > > measured on x86_64 with the same RoCE card and port). > > > These patches are the initial part of the implementation of the > "Shared Memory Communications-RDMA" (SMC-R) protocol as defined in > RFC7609 [1]. While SMC-R does not aim to replace TCP, > it taps a wealth of existing data center TCP socket applications > to become more efficient without the need for rewriting them. > SMC-R uses RDMA over Converged Ethernet (RoCE) to save CPU consumption. > For instance, when running 10 parallel connections with uperf, we measured > a decrease of 60% in CPU consumption with SMC-R compared to TCP/IP > (with throughput and latency comparable; > measured on x86_64 with the same RoCE card and port). > > SMC-R does not require an RDMA communication manager (RDMA CM). > > SMC-R inherits TCP qualities such as reliable connections, host-based > firewall packet filtering (on connection establishment) and unmodified > application of communication encryption such as TLS (transport layer > security) or SSL (secure sockets layer). Since original TCP is used to > establish SMC-R connections, load balancers and packet inspection based > on TCP/IP connection establishment continue to work for SMC-R. > > On the other hand using SMC-R implies: > - either involving a preload library when invoking the unchanged > TCP-application > or slightly modifying the source by simply changing the socket family > in the > socket() call > - accepting extra overhead and latency in connection establishment due to > SMC Connection Layer Control (CLC) handshake > - explicit coupling of RoCE ports with Ethernet ports > - not routable as currently built on RoCE V1 > - bypassing of packet-based networking features > - filtering (netfilter) > - sniffing (libpcap, packet sockets, (E)BPF) > - traffic control (scheduling, shaping) > - bypassing of IP-header based socket options > - bypassing of memory buffer (pressure) management > - unusable together with IPsec > > > > > Overview of the SMC-R Protocol described in informational RFC 7609 > > > > SMC-R is an open protocol that provides RDMA capabilities over RoCE > > transparently for applications exploiting TCP sockets. > > A new socket protocol family PF_SMC is introduced. > > There are no changes required to applications using the sockets API for TCP > > stream sockets other than the specification of the new socket family AF_SMC. > > Unmodified applications can be used by means of a dynamic preload shared > > library which rewrites the socket API call > > socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) into > > socket(AF_SMC, SOCK_STREAM, IPPROTO_TCP). > > SMC-R re-uses the address family AF_INET for all addressing purposes around > > struct sockaddr. > > > > > > SMC-R system architecture layers: > > > > +=============================================================================+ > > | | unmodified TCP application | > > | native SMC application +--------------------------------------+ > > | | dynamic preload shared library | > > +=============================================================================+ > > | SMC socket | > > +-----------------------------------------------------------------------------+ > > | | TCP socket (for connection establishment and fallback) | > > | IB verbs +--------------------------------------------------------+ > > | | IP | > > +--------------------+--------------------------------------------------------+ > > | RoCE device driver | some network device driver | > > +=============================================================================+ > > > > > > Terms: > > > > A link group is determined by an ordered peer pair of TCP client and TCP server > > (IP addresses and subnet). Reversed client server roles cause an own link group. > > A link is a logical point-to-point connection based on an > > infiniband reliable connected queue pair (RC-QP) between two RoCE ports > > (MACs and GIDs) of a peer pair. > > A link group can have 1..8 links for failover and load balancing. > > This initial Linux implementation always has 1 link per link group. > > Each link group on a peer can have 1..255 remote memory buffers (RMBs). > > If more RMBs are needed, a peer can open another link group > > (this initial Linux implementation) or fall back to TCP. > > Each RMB has its own particular size and its own (R)DMA mapping and credentials > > (rtoken consisting of rkey and RDMA "virtual address"). > > This initial Linux implementation uses physically contiguous memory for RMBs > > but we are working towards scattered memory because of memory fragmentation. > > Each RMB has 1..255 RMB elements (RMBEs) of equal size > > to provide multiplexing of connections within an RMB. > > An RMBE is the RDMA Write destination organized as wrapping ring buffer > > for data transmit of a particular connection in one direction > > (duplex by means of mirror symmetry as with TCP). > > This initial Linux implementation always has 1 RMBE per RMB > > and thus an individual RMB for each connection. > > > > > > SMC-R connection establishment with subsequent data transfer: > > > > CLIENT SERVER > > > > TCP three-way handshake: > > regular TCP SYN > > --------------------------------------------------------> > > regular TCP SYN ACK > > <-------------------------------------------------------- > > regular TCP ACK > > --------------------------------------------------------> > > > > SMC Connection Layer Control (CLC) handshake > > exchanges RDMA credentials between peers: > > via above TCP connection: SMC CLC Proposal > > --------------------------------------------------------> > > via above TCP connection: SMC CLC Accept > > <-------------------------------------------------------- > > via above TCP connection: SMC CLC Confirm > > --------------------------------------------------------> > > > > SMC Link Layer Control (LLC) (only once per link, i.e. 1st conn. of link group): > > RoCE RC-QP: SMC LLC Confirm Link > > <======================================================== > > RoCE RC-QP: SMC LLC Confirm Link response > > ========================================================> > > > > SMC data transmission (incl. SMC Connection Data Control (CDC) message): > > RoCE RC-QP: RDMA Write > > ========================================================> > > RoCE RC-QP: SMC CDC message (flow control) > > ========================================================> > > ... > > > > RoCE RC-QP: RDMA Write > > <======================================================== > > RoCE RC-QP: SMC CDC message (flow control) > > <======================================================== > > ... > > > > > > Data flow within an established connection: > > > > +---------------------------------------------------------------------------- > > | SENDER > > | sendmsg() > > | | > > | | produces into sndbuf [sender's process context] > > | v > > | +--------+ > > | | sndbuf | [ring buffer] > > | +--------+ > > | | > > | | consumes from sndbuf and produces into receiver's RMBE [any context] > > | | by sending RDMA Write followed by SMC CDC message over RoCE RC-QP > > | | > > +----|----------------------------------------------------------------------- > > | > > +----|----------------------------------------------------------------------- > > | v RECEIVER > > | +------+ > > | | RMBE | [ring buffer, can have size different from sender's sndbuf] > > | | | [RMBE represents rcvbuf, no further de-coupling as on sender side] > > | +------+ > > | | > > | | consumes from RMBE [receiver's process context] > > | v > > | recvmsg() > > +---------------------------------------------------------------------------- > > > > > > Flow control ("cursor" updates) by means of SMC CDC messages: > > > > SENDER RECEIVER > > > > sends updates via CDC-------------+ sends updates via CDC > > on consuming from sndbuf | on consuming from RMBE > > and producing into RMBE | by means of recvmsg() > > | | > > | | > > +-----------------------------------|------------+ > > | | > > +--v-------------------------+ +--v-----------------------+ > > | receiver's consumer cursor | | sender's producer cursor----+ > > +----------------|-----------+ +--------------------------+ | > > | | > > | receiver's RMBE | > > | +--------------------------+ | > > | | | | > > +--------------------------------+ | | > > | | | | > > | v | | > > | +------------| | > > |-------------+////////////| | > > |//RDMA data written by////| | > > |////sender that is////////| | > > |/available to be consumed/| | > > |///////// +---------------| | > > |----------+^ | | > > | | | | > > | +-----------------+ > > | | > > +--------------------------+ > > > > Sending updates of the producer cursor is immediate for low latency; > > something like Nagle's algorithm (absence of TCP_NODELAY) is optional and > > currently not part of this initial Linux implementation. > > Sending updates of the consumer cursor is conditional to avoid the > > silly window syndrome. > > > > > > Normal connection termination: > > > > Normal connection termination starts transitioning from socket state > > ACTIVE via either "Active Close" or "Passive Close". > > > > shutdown rdwr +-----------------+ > > or close, +-------------->| INIT / CLOSED |<-------------+ > > send PeerCon|nClosed +-----------------+ | PeerConnClosed > > | | | received > > | connection | established | > > | V | > > +----------------+ +-----------------+ +----------------+ > > |AppFinCloseWait | | ACTIVE | |PeerFinCloseWait| > > +----------------+ +-----------------+ +----------------+ > > | | | | > > | Active Close: | |Passive Close: | > > | close or | |PeerConnClosed or | > > | shutdown wr or| |PeerDoneWriting | > > | shutdown rdwr | |received | > > | V V | > > PeerConnClo|sed +--------------+ +-------------+ | close or > > received +--<----|PeerCloseWait1| |AppCloseWait1|--->----+ shutdown rdwr, > > | +--------------+ +-------------+ | send > > | PeerDoneWri|ting | shutdown wr, | PeerConnClosed > > | received | send Pee|rDoneWriting | > > | V V | > > | +--------------+ +-------------+ | > > +--<----|PeerCloseWait2| |AppCloseWait2|--->----+ > > +--------------+ +-------------+ > > > > In state CLOSED, the socket can be destructed only, once the application has > > issued a close(). > > > > Abnormal connection termination: > > > > +-----------------+ > > +-------------->| INIT / CLOSED |<-------------+ > > | +-----------------+ | > > | | > > | +-----------------------+ | > > | | Any state | | > > PeerConnAbo|rt | (before setting | | send > > received | | PeerConnClosed | | PeerConnAbort > > | | indicator in | | > > | | peer's RMBE) | | > > | +-----------------------+ | > > | | | | > > | Active Abort: | | Passive Abort: | > > | problem, | | PeerConnAbort | > > | send | | received, | > > | PeerConnAbort,| | ECONNRESET | > > | ECONNABORTED | | | > > | V V | > > | +--------------+ +--------------+ | > > +-------|PeerAbortWait | | ProcessAbort |------+ > > +--------------+ +--------------+ > > > > > > Implementation notes beyond RFC 7609: > > > > A PNET table in sysfs provides the mapping between network device names and > > RoCE Infiniband device names for the transparent switch of data communication. > > A PNET table can contain an arbitrary number of PNETIDs. > > Each PNETID contains exactly one (Ethernet) network device name > > and one or more RoCE Infiniband device names. > > Each device name can only exist in at most one PNETID (no overlapping). > > This initial Linux implementation allows at most one RoCE Infiniband device > > name per PNETID. > > After a new TCP connection is established, the network device name > > used for egress traffic with the TCP connection's local source IP address > > is used as key to lookup the unique PNETID, and the RoCE Infiniband device > > of this PNETID is used to switch data communication from TCP to RDMA > > during SMC CLC handshake. > > > > > > Problem determination: > > > > A protocol dissector is available with upstream wireshark for formatting > > SMC-R related RoCE LAN traffic. > > [https://code.wireshark.org/review/gitweb?p=wireshark.git;a=blob;f=epan/dissectors/packet-smcr.c] > > > > > > We are working on enhancing the Linux implementation to cover: > > > > - Improve default socket closing asynchronicity > > - Address corner cases with many parallel connections > > - Load balancing and fail-over > > - Urgent data > > - Splice and sendpage support > > - Keepalive > > - More socket options > > - IPv6 support > > - Tracing > > - Statistics support > > > - Improve default socket closing asynchronicity > - Address corner cases with many parallel connections > - Tracing > - Integrated load balancing and fail-over within a link group > - Splice and sendpage support > - IPv6 addressing support > - Keepalive, Cork > - Namespaces support > - Urgent data > - More socket options > - Diagnostics > - Statistics support > - SNMP support > > > > > References: > > > > [1] SMC-R Informational RFC: http://www.rfc-editor.org/info/rfc7609 > > Do you agree with this changed cover letter? > > Kind regards, > Ursula Braun > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-s390" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html