On 2025-02-04 18:27, Ming Lei wrote: > Hello David, > > On Thu, Jan 30, 2025 at 01:28:55PM -0800, David Wei wrote: >> Hi folks, I want to propose a discussion on adding zero copy to FUSE >> io_uring in the kernel. The source is some userspace buffer or device >> memory e.g. GPU VRAM. The destination is FUSE server in userspace, which >> will then either forward it over the network or to an underlying >> FS/block device. The FUSE server may want to read the data. >> >> My goal is to eliminate copies in this entire data path, including the >> initial hop between the userspace client and the kernel. I know Ming and >> Keith are working on adding ublk zero copy but it does not cover this >> initial hop and it does not allow the ublk/FUSE server to read the data. > > Not sure get the point, it depends on if the kernel buffer is initialized, > and you can't read data from one uninitialized kernel buffer. > > But if it is userspace or device buffer, the limit may be relaxed. When a client does a DIO write() to a FUSE filefd, the pages are pinned by the kernel and then passed to FUSE kernel. It is possible to then send these to the FUSE server, but it cannot read the data, only pass it onwards. > >> >> My idea is to use shared memory or dma-buf, i.e. the source data is >> encapsulated in an mmap()able fd. The client and FUSE server exchange >> this fd through a back channel with no kernel involvement. The FUSE >> server maps the fd into its address space and registers the fd with > > This approach need client code modification, which isn't generic and > can't cover existed posix applications. Yes, the fd exchange is not POSIX. But we could encode the API using say io_uring cmd if it is seen to be generically useful. > > There could be too many client processes, does this way really scale? For zero copy there is a cutover point where it performs better than copying. The trade off is between memcpy and the overheads of setting up zero copy. In this case, the client is required to be long lived and ideally the same shmfd is shared across multiple transactions. So the overhead is paid once and then amortised over multiple transactions. > >> io_uring via the io_uring_register() infra. When the client does e.g. a >> DIO write, the pages are pinned and forwarded to FUSE kernel, which does > > BTW, fuse supports write zero copy already, just read zero copy isn't > supported. Could you clarify exactly which direction and how much of the data path "zero copy" covers? > >> a lookup and understands that the pages belong to the fd that was >> registered from the FUSE server. Then io_uring tells the FUSE server >> that the data is in the fd it registered, so there is no need to copy >> anything at all. >> >> I would like to discuss this and get feedback from the community. My top >> question is why do this in the kernel at all? It is entirely possible to >> bypass the kernel entirely by having the client and FUSE server exchange >> the fd and then do the I/O purely through IPC. > > IMO, client code modification may not be accepted for existed applications. That's up to userspace. I don't think we need to limit ourselves to "no userspace code changes" or "POSIX only". > > > Thanks, > Ming >