On 8/9/23 7:57 PM, Mina Almasry wrote: > Changes in RFC v2: > ------------------ > > The sticking point in RFC v1[1] was the dma-buf pages approach we used to > deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept > that attempts to resolve this by implementing scatterlist support in the > networking stack, such that we can import the dma-buf scatterlist > directly. This is the approach proposed at a high level here[2]. > > Detailed changes: > 1. Replaced dma-buf pages approach with importing scatterlist into the > page pool. > 2. Replace the dma-buf pages centric API with a netlink API. > 3. Removed the TX path implementation - there is no issue with > implementing the TX path with scatterlist approach, but leaving > out the TX path makes it easier to review. > 4. Functionality is tested with this proposal, but I have not conducted > perf testing yet. I'm not sure there are regressions, but I removed > perf claims from the cover letter until they can be re-confirmed. > 5. Added Signed-off-by: contributors to the implementation. > 6. Fixed some bugs with the RX path since RFC v1. > > Any feedback welcome, but specifically the biggest pending questions > needing feedback IMO are: > > 1. Feedback on the scatterlist-based approach in general. > 2. Netlink API (Patch 1 & 2). > 3. Approach to handle all the drivers that expect to receive pages from > the page pool (Patch 6). > > [1] https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb0b4@xxxxxxxxx/T/ > [2] https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLXCc=H7Nw@xxxxxxxxxxxxxx/ > > ---------------------- > > * TL;DR: > > Device memory TCP (devmem TCP) is a proposal for transferring data to and/or > from device memory efficiently, without bouncing the data to a host memory > buffer. > > * Problem: > > A large amount of data transfers have device memory as the source and/or > destination. Accelerators drastically increased the volume of such transfers. > Some examples include: > - ML accelerators transferring large amounts of training data from storage into > GPU/TPU memory. In some cases ML training setup time can be as long as 50% of > TPU compute time, improving data transfer throughput & efficiency can help > improving GPU/TPU utilization. > > - Distributed training, where ML accelerators, such as GPUs on different hosts, > exchange data among them. > > - Distributed raw block storage applications transfer large amounts of data with > remote SSDs, much of this data does not require host processing. > > Today, the majority of the Device-to-Device data transfers the network are > implemented as the following low level operations: Device-to-Host copy, > Host-to-Host network transfer, and Host-to-Device copy. > > The implementation is suboptimal, especially for bulk data transfers, and can > put significant strains on system resources, such as host memory bandwidth, > PCIe bandwidth, etc. One important reason behind the current state is the > kernel’s lack of semantics to express device to network transfers. > > * Proposal: > > In this patch series we attempt to optimize this use case by implementing > socket APIs that enable the user to: > > 1. send device memory across the network directly, and > 2. receive incoming network packets directly into device memory. > > Packet _payloads_ go directly from the NIC to device memory for receive and from > device memory to NIC for transmit. > Packet _headers_ go to/from host memory and are processed by the TCP/IP stack > normally. The NIC _must_ support header split to achieve this. > > Advantages: > > - Alleviate host memory bandwidth pressure, compared to existing > network-transfer + device-copy semantics. > > - Alleviate PCIe BW pressure, by limiting data transfer to the lowest level > of the PCIe tree, compared to traditional path which sends data through the > root complex. > > * Patch overview: > > ** Part 1: netlink API > > Gives user ability to bind dma-buf to an RX queue. > > ** Part 2: scatterlist support > > Currently the standard for device memory sharing is DMABUF, which doesn't > generate struct pages. On the other hand, networking stack (skbs, drivers, and > page pool) operate on pages. We have 2 options: > > 1. Generate struct pages for dmabuf device memory, or, > 2. Modify the networking stack to process scatterlist. > > Approach #1 was attempted in RFC v1. RFC v2 implements approach #2. > > ** part 3: page pool support > > We piggy back on page pool memory providers proposal: > https://github.com/kuba-moo/linux/tree/pp-providers > > It allows the page pool to define a memory provider that provides the > page allocation and freeing. It helps abstract most of the device memory > TCP changes from the driver. > > ** part 4: support for unreadable skb frags > > Page pool iovs are not accessible by the host; we implement changes > throughput the networking stack to correctly handle skbs with unreadable > frags. > > ** Part 5: recvmsg() APIs > > We define user APIs for the user to send and receive device memory. > > Not included with this RFC is the GVE devmem TCP support, just to > simplify the review. Code available here if desired: > https://github.com/mina/linux/tree/tcpdevmem > > This RFC is built on top of net-next with Jakub's pp-providers changes > cherry-picked. > > * NIC dependencies: > > 1. (strict) Devmem TCP require the NIC to support header split, i.e. the > capability to split incoming packets into a header + payload and to put > each into a separate buffer. Devmem TCP works by using device memory > for the packet payload, and host memory for the packet headers. > > 2. (optional) Devmem TCP works better with flow steering support & RSS support, > i.e. the NIC's ability to steer flows into certain rx queues. This allows the > sysadmin to enable devmem TCP on a subset of the rx queues, and steer > devmem TCP traffic onto these queues and non devmem TCP elsewhere. > > The NIC I have access to with these properties is the GVE with DQO support > running in Google Cloud, but any NIC that supports these features would suffice. > I may be able to help reviewers bring up devmem TCP on their NICs. > > * Testing: > > The series includes a udmabuf kselftest that show a simple use case of > devmem TCP and validates the entire data path end to end without > a dependency on a specific dmabuf provider. > > ** Test Setup > > Kernel: net-next with this RFC and memory provider API cherry-picked > locally. > > Hardware: Google Cloud A3 VMs. > > NIC: GVE with header split & RSS & flow steering support. This set seems to depend on Jakub's memory provider patches and a netdev driver change which is not included. For the testing mentioned here, you must have a tree + branch with all of the patches. Is it publicly available? It would be interesting to see how well (easy) this integrates with io_uring. Besides avoiding all of the syscalls for receiving the iov and releasing the buffers back to the pool, io_uring also brings in the ability to seed a page_pool with registered buffers which provides a means to get simpler Rx ZC for host memory. Overall I like the intent and possibilities for extensions, but a lot of details are missing - perhaps some are answered by seeing an end-to-end implementation.