On Tue, 2018-01-16 at 06:52 -0800, Matthew Wilcox wrote: > I see the improvements that Facebook have been making to the nbd driver, > and I think that's a wonderful thing. Maybe the outcome of this topic > is simply: "Shut up, Matthew, this is good enough". > > It's clear that there's an appetite for userspace block devices; not for > swap devices or the root device, but for accessing data that's stored > in that silo over there, and I really don't want to bring that entire > mess of CORBA / Go / Rust / whatever into the kernel to get to it, > but it would be really handy to present it as a block device. > > I've looked at a few block-driver-in-userspace projects that exist, and > they all seem pretty bad. For example, one API maps a few gigabytes of > address space and plays games with vm_insert_page() to put page cache > pages into the address space of the client process. Of course, the TLB > flush overhead of that solution is criminal. > > I've looked at pipes, and they're not an awful solution. We've almost > got enough syscalls to treat other objects as pipes. The problem is > that they're not seekable. So essentially you're looking at having one > pipe per outstanding command. If yu want to make good use of a modern > NAND device, you want a few hundred outstanding commands, and that's a > bit of a shoddy interface. > > Right now, I'm leaning towards combining these two approaches; adding > a VM_NOTLB flag so the mmaped bits of the page cache never make it into > the process's address space, so the TLB shootdown can be safely skipped. > Then check it in follow_page_mask() and return the appropriate struct > page. As long as the userspace process does everything using O_DIRECT, > I think this will work. > > It's either that or make pipes seekable ... How about using the RDMA API and the rdma_rxe driver over loopback? The RDMA API supports zero-copy communication which is something the BSD socket API does not support. The RDMA API also supports byte-level granularity and the hot path (ib_post_send(), ib_post_recv(), ib_poll_cq()) does not require any system calls for PCIe RDMA adapters. The rdma_rxe driver however uses a system call to trigger the send doorbell. Bart.