Greetings,
I've been thinking about a POSIX-like API that would allow
read/write/send/recv to be zero-copy instead of being buffered. As
such, storage devices and network sockets can have data transferred to
and from them directly to a user-space application's buffers.
My focus was initially on network stacks and I drew inspiration from
DPDK. I'm also aware of some work underway on extending io_uring to
support zero copy.
A draft API would work as follows:
* The application fills-out a series of iovec's with buffers in its own
memory that can store data from protocols such as TCP or UDP. These
iovec's will serve as hints that will tell the network stack that it
can definitely map a part of a frame's contents into the described
buffers. For example, an iovec may contain { .iov_base = 0x4000,
.iov_len = 0xa000 }. In this case, the data payload may end-up anywhere
between 0x4000 and 0xe000 and after the syscall, its fields will be
overwritten to something like { .iov_base = 0x4036, .iov_len = 1460 }
* In order to receive packets, the application calls readv or a
readv-like syscall and its array of iovec will be modified to point to
data payloads. Given that their pages will be mapped directly to
user-space, some header fields or tail-room may have to be zero-ed out
before being mapped, in order to prevent information leaks. Anny array
of iovec's passed to one such readv syscall should be checked for
sanity such as being able to hold data payloads in corner cases, not
overlap with each-other and hold values that would prove to map pages
to.
* The return value would be the number of data payloads that have been
populated. Only the first such elements in the provided array would
end-up containing data payloads.
* The syscall's prototype would be quite identical to that of readv,
except that iov would not be a const struct iovec *, but just a struct
iovec * and the return type would be modified. Like so:
int zc_readv(int fd, struct iovec *iov, int iovcnt);
* In the case of write's a struct iovec may not suffice as the provided
buffers should not only provide the location and size of data to be
sent, but also the guarantee that the buffers have sufficient head and
tail room. A hackish syscall would look like so:
int zc_writev(int fd, const struct iovec (*iov)[2], int iovcnt);
* While the first iovec should describe the entire memory area
available to a packet, including enough head and tail room for headers
and CRC's or other fields specific to the NIC, the second should
describe a sub-buffer that holds the data to be written.
* Again, sanity checks should be performed on the entire array, for
things like having enough room for other fields, not overlapping,
proper alignment, ability to DMA to a device, etc.
* After calling zc_writev the pages associated with the provided
iovec's are immediately swapped for zero-pages to avoid data-leaks.
* For writes, arbitrary physical pages may not work for every NIC as
some are bound by 32bit addressing constrains on the PCIe bus, etc. As
such the application would have to manage a memory pool associated with
each file-descriptor(possibly NIC) that would contain memory that is
physically mapped to areas that can be DMA'ed to the proper devices.
For example one may mmap the file-descriptor to obtain a pool of a
certain size for this purpose.
This concept can be extended to storage devices, unfortunately I am
unfamiliar with NVMe and SCSI so I can only guess that they work in a
similar manner to NIC rings, in that data can be written and read to
arbitrary physical RAM(as allowed by the IOMMU). Syscalls similar to
zc_read and zc_write can be used on file descriptors pointing to
storage devices to fetch or write sectors that contain data belonging
to files. Some data should be zeroed-out in this case as well, as
sectors more often that not will contain data that does not belong to
the intended files.
For example one can mix such syscalls to read directly from storage
into NIC buffers, providing in-place encryption on the way(via TLS) and
send them to a client in a similar way that Netflix does with in-kernel
TLS and sendfile.
All the best,
Mihai