Hi, all
# Background
The background and previous discussion can be referred from [1].
We found SMC-D can be used to accelerate OS internal communication,
such as
loopback or between two containers within the same OS instance. So
this patch
set provides a kind of SMC-D dummy device (we call it the SMC-D
loopback device)
to emulate an ISM device, so that SMC-D can also be used on architectures
other than s390. The SMC-D loopback device are designed as a system
global
device, visible to all containers.
This version is implemented based on the generalized interface
provided by [2].
And there is an open issue of this version, which will be mentioned
later.
# Design
This patch set basically follows the design of the previous version.
Patch #1/9 ~ #3/9 attempt to decouple ISM-related structures from the
SMC-D
generalized code and extract some helpers to make SMC-D protocol
compatible
with devices other than s390 ISM device.
Patch #4/9 introduces a kind of loopback device, which is defined as
SMC-D v2
device and designed to provide communication between SMC sockets in
the same OS
instance.
+-------------------------------------------+
| +--------------+ +--------------+ |
| | SMC socket A | | SMC socket B | |
| +--------------+ +--------------+ |
| ^ ^ |
| | +----------------+ | |
| | | SMC stack | | |
| +--->| +------------+ |<--| |
| | | dummy | | |
| | | device | | |
| +-+------------+-+ |
| OS |
+-------------------------------------------+
Patch #5/9 ~ #8/9 expand SMC-D protocol interface (smcd_ops) for
scenarios where
SMC-D is used to communicate within VM (loopback here) or between VMs
on the same
host (based on virtio-ism device, see [3]). What these scenarios have
in common
is that the local sndbuf and peer RMB can be mapped to same physical
memory region,
so the data copy between the local sndbuf and peer RMB can be omitted.
Performance
improvement brought by this extension can be found in # Benchmark Test.
+----------+ +----------+
| socket A | | socket B |
+----------+ +----------+
| ^
| +---------+ |
regard as | | ----------|
local sndbuf | B's | regard as
| | RMB | local RMB
|-------> | |
+---------+
Patch #9/9 realizes the support of loopback device for the
above-mentioned expanded
SMC-D protocol interface.
# Benchmark Test
* Test environments:
- VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
- SMC sndbuf/RMB size 1MB.
* Test object:
- TCP lo: run on TCP loopback.
- domain: run on UNIX domain.
- SMC lo: run on SMC loopback device with patch #1/9 ~ #4/9.
- SMC lo-nocpy: run on SMC loopback device with patch #1/9 ~ #9/9.
1. ipc-benchmark (see [4])
- ./<foo> -c 1000000 -s 100
TCP-lo domain
SMC-lo SMC-lo-nocpy
Message
rate (msg/s) 79025 115736(+46.45%)
146760(+85.71%) 149800(+89.56%)
2. sockperf
- serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
- clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp
--msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
TCP-lo SMC-lo
SMC-lo-nocpy
Bandwidth(MBps) 4822.388 4940.918(+2.56%)
8086.67(+67.69%)
Latency(us) 6.298 3.352(-46.78%)
3.35(-46.81%)
3. iperf3
- serv: <smc_run> taskset -c <cpu> iperf3 -s
- clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15
TCP-lo SMC-lo
SMC-lo-nocpy
Bitrate(Gb/s) 40.7 40.5(-0.49%)
72.4(+77.89%)
4. nginx/wrk
- serv: <smc_run> nginx
- clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80
TCP-lo SMC-lo
SMC-lo-nocpy
Requests/s 155994.57 214544.79(+37.53%)
215538.55(+38.17%)
# Open issue
The open issue has not been resolved now is about how to detect that
the source
and target of CLC proposal are within the same OS instance and can
communicate
through the SMC-D loopback device. Similar issue also exists when
using virtio-ism
devices (the background and details of virtio-ism device can be
referred from [3]).
In previous discussions, multiple options were proposed (see [5]).
Thanks again for
the help of the community. cc Alexandra Winter :)
But as we discussed, these solutions have some imperfection. So this
version of RFC
continues to use previous workaround, that is, a 64-bit random GID is
generated for
SMC-D loopback device. If the GIDs of the devices found by two peers
are the same,
then they are considered to be in the same OS instance and can
communicate with each
other by the loopback device.
This approach has very small risk. Assume the following situations:
(1) Assume that the SMC-D loopback devices of the two OS instances
happen to
generate the same 64-bit GID.
For the convenience of description, we refer to the sockets on
these two
different OS instance as server A and client B.
A will misjudge that the two are on the same OS instance because
the same GID
in CLC proposal message. Then A creates its RMB and sends 64-bit
token-A to B
in CLC accept message.
B receives the CLC accept message. And according to patch #7/9, B
tries to
attach its sndbuf to A's RMB by token-A.
(2) Assume that the OS instance where B is located happens to have an
unattached
RMB whose 64-bit token is same as token-A.
Then B successfully attaches its sndbuf to the wrong RMB, and
creates its RMB,
sends token-B to A in CLC confirm message.
Similarly, A receives the message and tries to attach its sndbuf
to B's RMB by
token-B.
(3) Similar to (2), assume that the OS instance where A is located
happens to have
an unattached RMB whose 64-bit token is same as token-B.
Then A successfully attach its sndbuf to the wrong RMB. Both
sides mistakenly
believe that an SMC-D connection based on the loopback device is
established
between them.
If the above 3 coincidences all happen, that is, 64-bit random number
conflicts occur
3 times, then an unreachable SMC-D connection will be established,
which is nasty.
If one of above is not satisfied, it will safely fallback to TCP.
Since the chances of these happening are very small, I wonder if this
risk of 1/2^(64*3)
probability can be tolerated ?