Re: [RFC PATCH net-next v2 0/5] net/smc:Introduce SMC-D based loopback acceleration

Dust Li <dust.li@xxxxxxxxxxxxxxxxx> · Mon, 26 Dec 2022 18:46:08 +0800

On Tue, Dec 20, 2022 at 03:02:45PM +0100, Niklas Schnelle wrote:
>On Tue, 2022-12-20 at 11:21 +0800, Wen Gu wrote:
>> Hi, all
>> 
>> # Background
>> 
>> As previously mentioned in [1], we (Alibaba Cloud) are trying to use SMC
>> to accelerate TCP applications in cloud environment, improving inter-host
>> or inter-VM communication.
>> 
>> In addition of these, we also found the value of SMC-D in scenario of local
>> inter-process communication, such as accelerate communication between containers
>> within the same host. So this RFC tries to provide a SMC-D loopback solution
>> in such scenario, to bring a significant improvement in latency and throughput
>> compared to TCP loopback.
>> 
>> # Design
>> 
>> This patch set provides a kind of SMC-D loopback solution.
>> 
>> Patch #1/5 and #2/5 provide an SMC-D based dummy device, preparing for the
>> inter-process communication acceleration. Except for loopback acceleration,
>> the dummy device can also meet the requirements mentioned in [2], which is
>> providing a way to test SMC-D logic for broad community without ISM device.
>> 
>>  +------------------------------------------+
>>  |  +-----------+           +-----------+   |
>>  |  | process A |           | process B |   |
>>  |  +-----------+           +-----------+   |
>>  |       ^                        ^         |
>>  |       |    +---------------+   |         |
>>  |       |    |   SMC stack   |   |         |
>>  |       +--->| +-----------+ |<--|         |
>>  |            | |   dummy   | |             |
>>  |            | |   device  | |             |
>>  |            +-+-----------+-+             |
>>  |                   VM                     |
>>  +------------------------------------------+
>> 
>> Patch #3/5, #4/5, #5/5 provides a way to avoid data copy from sndbuf to RMB
>> and improve SMC-D loopback performance. Through extending smcd_ops with two
>> new semantic: attach_dmb and detach_dmb, sender's sndbuf shares the same
>> physical memory region with receiver's RMB. The data copied from userspace
>> to sender's sndbuf directly reaches the receiver's RMB without unnecessary
>> memory copy in the same kernel.
>> 
>>  +----------+                     +----------+
>>  | socket A |                     | socket B |
>>  +----------+                     +----------+
>>        |                               ^
>>        |         +---------+           |
>>   regard as      |         | ----------|
>>   local sndbuf   |  B's    |     regard as
>>        |         |  RMB    |     local RMB
>>        |-------> |         |
>>                  +---------+
>
>Hi Wen Gu,
>
>I maintain the s390 specific PCI support in Linux and would like to
>provide a bit of background on this. You're surely wondering why we
>even have a copy in there for our ISM virtual PCI device. To understand
>why this copy operation exists and why we need to keep it working, one
>needs a bit of s390 aka mainframe background.
>
>On s390 all (currently supported) native machines have a mandatory
>machine level hypervisor. All OSs whether z/OS or Linux run either on
>this machine level hypervisor as so called Logical Partitions (LPARs)
>or as second/third/… level guests on e.g. a KVM or z/VM hypervisor that
>in turn runs in an LPAR. Now, in terms of memory this machine level
>hypervisor sometimes called PR/SM unlike KVM, z/VM, or VMWare is a
>partitioning hypervisor without paging. This is one of the main reasons
>for the very-near-native performance of the machine hypervisor as the
>memory of its guests acts just like native RAM on other systems. It is
>never paged out and always accessible to IOMMU translated DMA from
>devices without the need for pinning pages and besides a trivial
>offset/limit adjustment an LPAR's MMU does the same amount of work as
>an MMU on a bare metal x86_64/ARM64 box.
>
>It also means however that when SMC-D is used to communicate between
>LPARs via an ISM device there is  no way of mapping the DMBs to the
>same physical memory as there exists no MMU-like layer spanning
>partitions that could do such a mapping. Meanwhile for machine level
>firmware including the ISM virtual PCI device it is still possible to
>_copy_ memory between different memory partitions. So yeah while I do
>see the appeal of skipping the memcpy() for loopback or even between
>guests of a paging hypervisor such as KVM, which can map the DMBs on
>the same physical memory, we must keep in mind this original use case
>requiring a copy operation.
>
>Thanks,
>Niklas
>
>> 
>> # Benchmark Test
>> 
>>  * Test environments:
>>       - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>>       - SMC sndbuf/RMB size 1MB.
>> 
>>  * Test object:
>>       - TCP: run on TCP loopback.
>>       - domain: run on UNIX domain.
>>       - SMC lo: run on SMC loopback device with patch #1/5 ~ #2/5.
>>       - SMC lo-nocpy: run on SMC loopback device with patch #1/5 ~ #5/5.
>> 
>> 1. ipc-benchmark (see [3])
>> 
>>  - ./<foo> -c 1000000 -s 100
>> 
>>                        TCP              domain              SMC-lo             SMC-lo-nocpy
>> Message
>> rate (msg/s)         75140      129548(+72.41)    152266(+102.64%)         151914(+102.17%)
>
>Interesting that it does beat UNIX domain sockets. Also, see my below
>comment for nginx/wrk as this seems very similar.
>
>> 
>> 2. sockperf
>> 
>>  - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>>  - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>> 
>>                        TCP                  SMC-lo             SMC-lo-nocpy
>> Bandwidth(MBps)   4943.359        4936.096(-0.15%)        8239.624(+66.68%)
>> Latency(us)          6.372          3.359(-47.28%)            3.25(-49.00%)
>> 
>> 3. iperf3
>> 
>>  - serv: <smc_run> taskset -c <cpu> iperf3 -s
>>  - clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15
>> 
>>                        TCP                  SMC-lo             SMC-lo-nocpy
>> Bitrate(Gb/s)         40.5            41.4(+2.22%)            76.4(+88.64%)
>> 
>> 4. nginx/wrk
>> 
>>  - serv: <smc_run> nginx
>>  - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80
>> 
>>                        TCP                  SMC-lo             SMC-lo-nocpy
>> Requests/s       154643.22      220894.03(+42.84%)        226754.3(+46.63%)
>
>
>This result is very interesting indeed. So with the much more realistic
>nginx/wrk workload it seems to copy hurts much less than the
>iperf3/sockperf would suggest while SMC-D itself seems to help more.
>I'd hope that this translates to actual applications as well. Maybe
>this makes SMC-D based loopback interesting even while keeping the
>copy, at least until we can come up with a sane way to work a no-copy
>variant into SMC-D?

Yes, SMC-D based loopback shows great advantages over TCP loopback, with
or without copy.

The advantage of zero-copy should be observed when we need to transfer
a large mount of data. But here in this wrk/nginx case, the test file
transferred from server to client is a small file. So we didn't see much gain.
If we use a large file(e.g >=1MB file), I think we should observe a much
different result.

Thinks!