On 2/23/22 14:57, Gao Xiang wrote: > On Mon, Feb 21, 2022 at 02:59:48PM -0500, Gabriel Krisman Bertazi wrote: >> I'd like to discuss an interface to implement user space block devices, >> while avoiding local network NBD solutions. There has been reiterated >> interest in the topic, both from researchers [1] and from the community, >> including a proposed session in LSFMM2018 [2] (though I don't think it >> happened). >> >> I've been working on top of the Google iblock implementation to find >> something upstreamable and would like to present my design and gather >> feedback on some points, in particular zero-copy and overall user space >> interface. >> >> The design I'm pending towards uses special fds opened by the driver to >> transfer data to/from the block driver, preferably through direct >> splicing as much as possible, to keep data only in kernel space. This >> is because, in my use case, the driver usually only manipulates >> metadata, while data is forwarded directly through the network, or >> similar. It would be neat if we can leverage the existing >> splice/copy_file_range syscalls such that we don't ever need to bring >> disk data to user space, if we can avoid it. I've also experimented >> with regular pipes, But I found no way around keeping a lot of pipes >> opened, one for each possible command 'slot'. >> >> [1] https://dl.acm.org/doi/10.1145/3456727.3463768 >> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html > > I'm interested in this general topic too. One of our use cases is > that we need to process network data in some degree since many > protocols are application layer protocols so it seems more reasonable > to process such protocols in userspace. And another difference is that > we may have thousands of devices in a machine since we'd better to run > containers as many as possible so the block device solution seems > suboptimal to us. Yet I'm still interested in this topic to get more > ideas. > > Btw, As for general userspace block device solutions, IMHO, there could > be some deadlock issues out of direct reclaim, writeback, and userspace > implementation due to writeback user requests can be tripped back to > the kernel side (even the dependency crosses threads). I think they are > somewhat hard to fix with user block device solutions. For example, > https://lore.kernel.org/r/CAM1OiDPxh0B1sXkyGCSTEpdgDd196-ftzLE-ocnM8Jd2F9w7AA@xxxxxxxxxxxxxx This is already fixed with prctl() support. See: https://lore.kernel.org/linux-fsdevel/20191112001900.9206-1-mchristi@xxxxxxxxxx/ -- Damien Le Moal Western Digital Research