[RFC] Extension to POSIX API for zero-copy data transfers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Greetings,

I've been thinking about a POSIX-like API that would allow read/write/send/recv to be zero-copy instead of being buffered. As such, storage devices and network sockets can have data transferred to and from them directly to a user-space application's buffers.

My focus was initially on network stacks and I drew inspiration from DPDK. I'm also aware of some work underway on extending io_uring to support zero copy.

A draft API would work as follows:
* The application fills-out a series of iovec's with buffers in its own memory that can store data from protocols such as TCP or UDP. These iovec's will serve as hints that will tell the network stack that it can definitely map a part of a frame's contents into the described buffers. For example, an iovec may contain { .iov_base = 0x4000, .iov_len = 0xa000 }. In this case, the data payload may end-up anywhere between 0x4000 and 0xe000 and after the syscall, its fields will be overwritten to something like { .iov_base = 0x4036, .iov_len = 1460 } * In order to receive packets, the application calls readv or a readv-like syscall and its array of iovec will be modified to point to data payloads. Given that their pages will be mapped directly to user-space, some header fields or tail-room may have to be zero-ed out before being mapped, in order to prevent information leaks. Anny array of iovec's passed to one such readv syscall should be checked for sanity such as being able to hold data payloads in corner cases, not overlap with each-other and hold values that would prove to map pages to. * The return value would be the number of data payloads that have been populated. Only the first such elements in the provided array would end-up containing data payloads. * The syscall's prototype would be quite identical to that of readv, except that iov would not be a const struct iovec *, but just a struct iovec * and the return type would be modified. Like so:
 int zc_readv(int fd, struct iovec *iov, int iovcnt);

* In the case of write's a struct iovec may not suffice as the provided buffers should not only provide the location and size of data to be sent, but also the guarantee that the buffers have sufficient head and tail room. A hackish syscall would look like so:
 int zc_writev(int fd, const struct iovec (*iov)[2], int iovcnt);
* While the first iovec should describe the entire memory area available to a packet, including enough head and tail room for headers and CRC's or other fields specific to the NIC, the second should describe a sub-buffer that holds the data to be written. * Again, sanity checks should be performed on the entire array, for things like having enough room for other fields, not overlapping, proper alignment, ability to DMA to a device, etc. * After calling zc_writev the pages associated with the provided iovec's are immediately swapped for zero-pages to avoid data-leaks. * For writes, arbitrary physical pages may not work for every NIC as some are bound by 32bit addressing constrains on the PCIe bus, etc. As such the application would have to manage a memory pool associated with each file-descriptor(possibly NIC) that would contain memory that is physically mapped to areas that can be DMA'ed to the proper devices. For example one may mmap the file-descriptor to obtain a pool of a certain size for this purpose.

This concept can be extended to storage devices, unfortunately I am unfamiliar with NVMe and SCSI so I can only guess that they work in a similar manner to NIC rings, in that data can be written and read to arbitrary physical RAM(as allowed by the IOMMU). Syscalls similar to zc_read and zc_write can be used on file descriptors pointing to storage devices to fetch or write sectors that contain data belonging to files. Some data should be zeroed-out in this case as well, as sectors more often that not will contain data that does not belong to the intended files.

For example one can mix such syscalls to read directly from storage into NIC buffers, providing in-place encryption on the way(via TLS) and send them to a client in a similar way that Netflix does with in-kernel TLS and sendfile.

All the best,
Mihai







[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux