On 24.05.22 09:49, Tony Lu wrote: > On Tue, May 24, 2022 at 02:52:07PM +0800, D. Wythe wrote: >> From: "D. Wythe" <alibuda@xxxxxxxxxxxxxxxxx> >> >> Hi Karsten, >> >> We are promoting SMC-R to the field of cloud computing, dues to the >> particularity of business on the cloud, the scale and the types of >> customer applications are unpredictable. As a participant of SMC-R, we >> also hope that SMC-R can cover more application scenarios. Therefore, >> many connection problems are exposed during this time. There are two >> main issue, one is that the establishment of a single connection takes >> longer than that of the TCP, another is that the degree of concurrency >> is low under multi-connection processing. This patch set is mainly >> optimized for the first issue, and the follow-up of the second issue >> will be synchronized in the future. >> >> In terms of communication process, under current implement, a TCP >> three-way handshake only needs 1-RTT time, while SMC-R currently >> requires 4-RTT times, including 2-RTT over IP(TCP handshake, SMC >> proposal & accept ) and 2-RTT over IB ( two times RKEY exchange), which >> is most influential factor affecting connection established time at the >> moment. >> >> We have noticed that single network interface card is mainstream on the >> cloud, dues to the advantages of cloud deployment costs and the cloud's >> own disaster recovery support. On the other hand, the emergence of RoCE >> LAG technology makes us no longer need to deal with multiple RDMA >> network interface cards by ourselves, just like NIC bonding does. In >> Alibaba, Roce LAG is widely used for RDMA. > > I think this is an interesting topic whether we need SMC-level link > redundancy. I agreed with that RoCE LAG and RDMA in cloud vendors handle > redundancy and failover in the lower layer, and do it transparently for > SMC. > > So let's move on, if a RDMA device has redundancy ability, we could make > SMC simpler by give an option for user-space or based on the device > capability (if we have this flag). This allows under layer to ensure the > reliability of link group. > > As RFC 7609 mentioned, we should do some extra work for reliability to > add link. It should be an optional work if the device have capability > for redundancy, and make link group simpler and faster (for the > so-called SMC-2RTT in this RFC). > > I also notice that RFC 7609 is released on August 2015, which is earlier > than RoCE LAG. RoCE LAG is provided after ConnectX-3/ConnectX-3 Pro in > kernel 4.0, and is available in 2017. And cloud vendors' RDMA adapters, > such as Alibaba Elastic RDMA adapter in [1]. > > Given that, I propose whether the second link can be used as an option > in newly created link group. Also, if it is possible, RFC 7609 can be > updated or extend it for this nowadays case. > > Looking forward for your message, Karsten, D. Wythe and folks. > > [1] https://lore.kernel.org/linux-rdma/20220523075528.35017-1-chengyou@xxxxxxxxxxxxxxxxx/ > > Thanks, > Tony Lu > Thank you D. Wythe for your proposals, the prototype and measurements. They sound quite promising to us. We need to carefully evaluate them and make sure everything is compatible with the existing implementations of SMC-D and SMC-R v1 and v2. In the typical s390 environment ROCE LAG is propably not good enough, as the card is still a single point of failure. So your ideas need to be compatible with link redundancy. We also need to consider that the extension of the protocol does not block other desirable extensions. Your prototype is very helpful for the understanding. Before submitting any code patches to net-next, we should agree on the details of the protocol extension. Maybe you could formulate your proposal in plain text, so we can discuss it here? We also need to inform you that several public holidays are upcoming in the next weeks and several of our team will be out for summer vacation, so please allow for longer response times. Kind regards Alexandra Winter >> In that case, SMC-R have only one single link, if so, the RKEY LLC >> messages that to perform information exchange in all links are no longer >> needed, the SMC Proposal & accept has already complete the exchange of >> all information needed. So we think that we can remove the RKEY exchange >> in that case, which will save us 2-RTT over IB. We call it as SMC-R 2-RTT. >> >> On the other hand, we can use TCP fast open, carry the SMC proposal data >> by TCP SYN message, reduce the time that the SMC waits for the TCP >> connection to be established. This will save us another 1-RTT over IP. >> >> Based on the above two viewpoints, in this scenario, we can compress the >> communication process of SMC-R into 1-RTT over IP, so that we can >> theoretically obtain a time close to that of TCP connection >> establishment. We call it as SMC-R 1-RTT. Of course, the specific results >> will also be affected by the implementation. >> >> In our test environment, we host two VMs on the same host for wrk/nginx >> tests, used a script similar to the following to performing test: >> >> Client.sh >> >> conn=$1 >> thread=$2 >> >> wrk -H ‘Connection: Close’ -c ${conn} -t ${thread} -d 10 >> >> Server.sh >> >> sysctl -w net.ipv4.tcp_fastopen=3 >> smc_run nginx >> >> Statistic shows that: >> >> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+ >> |type|args | -c1 -t1 | -c2 -t1 | -c5 -t1 | -c10 -t1 | -c200 -t1 | -c200 -t4 | -c2000 -t8 | >> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+ >> |next-next | 4188.5qps | 5942.04qps | 7621.81qps | 7678.62qps | 8204.94qps | 8457.57qps | 5687.60qps | >> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+ >> |SMC-2RTT | 4730.17qps | 7394.85qps | 11532.78qps | 12016.22qps | 11520.81qps | 11391.36qps | 10364.41qps | >> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+ >> |SMC-1RTT | 5702.77qps | 9645.18qps | 11899.20qps | 12005.16qps | 11536.67qps | 11420.87qps | 10392.4qps | >> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+- >> |TCP | 6415.74qps | 11034.10qps | 16716.21qps | 22217.06qps | 35926.74qps | 117460.qps | 120291.16qps | >> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+ >> >> It can clearly be seen that: >> >> 1. In step by step short-link scenarios ( -c1 -t1 ), SMC-R after >> optimization can reach 88% of TCP. There are still many implementation >> details that can be optimized, we hope to optimize the performance of >> SMC in this scenario to 90% of TCP. >> >> 2. The problem is very serious in the scenario of multi-threading and >> multi-connection, the worst case is only 10% of TCP. Even though the >> SMC-1RTT has certain optimizations for this scenario, it is clear that >> the bottleneck is not here. We are doing some prototyping to solve this, >> we hope to reach 60% of TCP in multi-threading and multi-connection >> scenarios, and SMC-1RTT is the important prerequisite for upper limit of >> subsequent optimization. >> >> In this patch set, we had only completed a simple prototype, only make >> sure SMC-1RTT can works. >> >> Sincerely, we are looking forward for you comments, please >> let us know if you have any suggestions. >> >> Thanks. >> >> Signed-off-by: D. Wythe <alibuda@xxxxxxxxxxxxxxxxx> >> --- --------8< snip >8--------