On 04.08.22 01:41, Stephen Hemminger wrote: > On Wed, 3 Aug 2022 16:27:54 -0400 > Matthew Rosato <mjrosato@xxxxxxxxxxxxx> wrote: > >> On 7/20/22 1:00 PM, Tony Lu wrote: >>> Hi all, >>> >>> # Background >>> >>> We (Alibaba Cloud) have already used SMC in cloud environment to >>> transparently accelerate TCP applications with ERDMA [1]. Nowadays, >>> there is a common scenario that deploy containers (which runtime is >>> based on lightweight virtual machine) on ECS (Elastic Compute Service), >>> and the containers may want to be scheduled on the same host in order to >>> get higher performance of network, such as AI, big data or other >>> scenarios that are sensitive with bandwidth and latency. Currently, the >>> performance of inter-VM is poor and CPU resource is wasted (see >>> #Benchmark virtio). This scenario has been discussed many times, but a >>> solution for a common scenario for applications is missing [2] [3] [4]. >>> >>> # Design >>> >>> In inter-VM scenario, we use ivshmem (Inter-VM shared memory device) >>> which is modeled by QEMU [5]. With it, multiple VMs can access one >>> shared memory. This shared memory device is statically created by host >>> and shared to desired guests. The device exposes as a PCI BAR, and can >>> interrupt its peers (ivshmem-doorbell). >>> >>> In order to use ivshmem in SMC, we write a draft device driver as a >>> bridge between SMC and ivshmem PCI device. To make it easier, this >>> driver acts like a SMC-D device in order to fit in SMC without modifying >>> the code, which is named ivpci (see patch #1). >>> >>> ┌───────────────────────────────────────┐ >>> │ ┌───────────────┐ ┌───────────────┐ │ >>> │ │ VM1 │ │ VM2 │ │ >>> │ │┌─────────────┐│ │┌─────────────┐│ │ >>> │ ││ Application ││ ││ Application ││ │ >>> │ │├─────────────┤│ │├─────────────┤│ │ >>> │ ││ SMC ││ ││ SMC ││ │ >>> │ │├─────────────┤│ │├─────────────┤│ │ >>> │ ││ ivpci ││ ││ ivpci ││ │ >>> │ └└─────────────┘┘ └└─────────────┘┘ │ >>> │ x * x * │ >>> │ x ****************x* * │ >>> │ x xxxxxxxxxxxxxxxxx* * │ >>> │ x x * * │ >>> │ ┌───────────────┐ ┌───────────────┐ │ >>> │ │shared memories│ │ivshmem-server │ │ >>> │ └───────────────┘ └───────────────┘ │ >>> │ HOST A │ >>> └───────────────────────────────────────┘ >>> *********** Control flow (interrupt) >>> xxxxxxxxxxx Data flow (memory access) >>> >>> Inside ivpci driver, it implements almost all the operations of SMC-D >>> device. It can be divided into two parts: >>> >>> - control flow, most of it is same with SMC-D, use ivshmem trigger >>> interruptions in ivpci and process CDC flow. >>> >>> - data flow, the shared memory of each connection is one large region >>> and divided into two part for local and remote RMB. Every writer >>> syscall copies data to sndbuf and calls ISM's move_data() to move data >>> to remote RMB in ivshmem and interrupt remote. And reader then >>> receives interruption and check CDC message, consume data if cursor is >>> updated. >>> >>> # Benchmark >>> >>> Current POC of ivpci is unstable and only works for single SMC >>> connection. Here is the brief data: >>> >>> Items Latency (pingpong) Throughput (64KB) >>> TCP (virtio) 19.3 us 3794.185 MBps >>> TCP (SR-IOV) 13.2 us 3948.792 MBps >>> SMC (ivshmem) 6.3 us 11900.269 MBps >>> >>> Test environments: >>> >>> - CPU Intel Xeon Platinum 8 core, mem 32 GiB >>> - NIC Mellanox CX4 with 2 VFs in two different guests >>> - using virsh to setup virtio-net + vhost >>> - using sockperf and single connection >>> - SMC + ivshmem throughput uses one-copy (userspace -> kernel copy) >>> with intrusive modification of SMC (see patch #1), latency (pingpong) >>> use two-copy (user -> kernel and move_data() copy, patch version). >>> >>> With the comparison, SMC with ivshmem gets 3-4x bandwidth and a half >>> latency. >>> >>> TCP + virtio is the most usage solution for guest, it gains lower >>> performance. Moreover, it consumes extra thread with full CPU core >>> occupied in host to transfer data, wastes more CPU resource. If the host >>> is very busy, the performance will be worse. >>> >> >> Hi Tony, >> >> Quite interesting! FWIW for s390x we are also looking at passthrough of >> host ISM devices to enable SMC-D in QEMU guests: >> https://lore.kernel.org/kvm/20220606203325.110625-1-mjrosato@xxxxxxxxxxxxx/ >> https://lore.kernel.org/kvm/20220606203614.110928-1-mjrosato@xxxxxxxxxxxxx/ >> >> But seems to me an 'emulated ISM' of sorts could still be interesting >> even on s390x e.g. for scenarios where host device passthrough is not >> possible/desired. >> >> Out of curiosity I tried this ivpci module on s390x but the device won't >> probe -- This is possibly an issue with the s390x PCI emulation layer in >> QEMU, I'll have to look into that. >> >>> # Discussion >>> >>> This RFC and solution is still in early stage, so we want to come it up >>> as soon as possible and fully discuss with IBM and community. We have >>> some topics putting on the table: >>> >>> 1. SMC officially supports this scenario. >>> >>> SMC + ivshmem shows huge improvement when communicating inter VMs. SMC-D >>> and mocking ISM device might not be the official solution, maybe another >>> extension for SMC besides SMC-R and SMC-D. So we are wondering if SMC >>> would accept this idea to fix this scenario? Are there any other >>> possibilities? >> >> I am curious about ivshmem and its current state though -- e.g. looking >> around I see mention of v2 which you also referenced but don't see any >> activity on it for a few years? And as far as v1 ivshmem -- server "not >> for production use", etc. >> >> Thanks, >> Matt >> >>> >>> 2. Implementation of SMC for inter-VM. >>> >>> SMC is used in container and cloud environment, maybe we can propose a >>> new device and new protocol if possible in these new scenarios to solve >>> this problem. >>> >>> 3. Standardize this new protocol and device. >>> >>> SMC-R has an open RFC 7609, so can this new device or protocol like >>> SMC-D can be standardized. There is a possible option that proposing a >>> new device model in QEMU + virtio ecosystem and SMC supports this >>> standard virtio device, like [6]. >>> >>> If there are any problems, please point them out. >>> >>> Hope to hear from you, thank you. >>> >>> [1] https://lwn.net/Articles/879373/ >>> [2] https://projectacrn.github.io/latest/tutorials/enable_ivshmem.html >>> [3] https://dl.acm.org/doi/10.1145/2847562 >>> [4] https://hal.archives-ouvertes.fr/hal-00368622/document >>> [5] https://github.com/qemu/qemu/blob/master/docs/specs/ivshmem-spec.txt >>> [6] https://github.com/siemens/jailhouse/blob/master/Documentation/ivshmem-v2-specification.md >>> >>> Signed-off-by: Tony Lu <tonylu@xxxxxxxxxxxxxxxxx> > > > Also looks a lot like existing VSOCK which has transports for Virtio, HyperV and VMWare already. To have it documented in this thread: As Wenjia Zhang <wenjia@xxxxxxxxxxxxx> mentioned in https://lore.kernel.org/netdev/Yt9Xfv0bN0AGMdGP@TonyMac-Alibaba/t/#mcfaa50f7142f923d2b570dc19b70c73ceddc1270 we are working on some patches to cleanup the interface between the ism device driver and the SMC-D protocol layer. They may simplify a project like the one described in this RFC. Stay tuned.