On 2/23/22 17:11, Gao Xiang wrote: > On Wed, Feb 23, 2022 at 04:46:41PM +0900, Damien Le Moal wrote: >> On 2/23/22 14:57, Gao Xiang wrote: >>> On Mon, Feb 21, 2022 at 02:59:48PM -0500, Gabriel Krisman Bertazi wrote: >>>> I'd like to discuss an interface to implement user space block devices, >>>> while avoiding local network NBD solutions. There has been reiterated >>>> interest in the topic, both from researchers [1] and from the community, >>>> including a proposed session in LSFMM2018 [2] (though I don't think it >>>> happened). >>>> >>>> I've been working on top of the Google iblock implementation to find >>>> something upstreamable and would like to present my design and gather >>>> feedback on some points, in particular zero-copy and overall user space >>>> interface. >>>> >>>> The design I'm pending towards uses special fds opened by the driver to >>>> transfer data to/from the block driver, preferably through direct >>>> splicing as much as possible, to keep data only in kernel space. This >>>> is because, in my use case, the driver usually only manipulates >>>> metadata, while data is forwarded directly through the network, or >>>> similar. It would be neat if we can leverage the existing >>>> splice/copy_file_range syscalls such that we don't ever need to bring >>>> disk data to user space, if we can avoid it. I've also experimented >>>> with regular pipes, But I found no way around keeping a lot of pipes >>>> opened, one for each possible command 'slot'. >>>> >>>> [1] https://dl.acm.org/doi/10.1145/3456727.3463768 >>>> [2] https://www.spinics.net/lists/linux-fsdevel/msg120674.html >>> >>> I'm interested in this general topic too. One of our use cases is >>> that we need to process network data in some degree since many >>> protocols are application layer protocols so it seems more reasonable >>> to process such protocols in userspace. And another difference is that >>> we may have thousands of devices in a machine since we'd better to run >>> containers as many as possible so the block device solution seems >>> suboptimal to us. Yet I'm still interested in this topic to get more >>> ideas. >>> >>> Btw, As for general userspace block device solutions, IMHO, there could >>> be some deadlock issues out of direct reclaim, writeback, and userspace >>> implementation due to writeback user requests can be tripped back to >>> the kernel side (even the dependency crosses threads). I think they are >>> somewhat hard to fix with user block device solutions. For example, >>> https://lore.kernel.org/r/CAM1OiDPxh0B1sXkyGCSTEpdgDd196-ftzLE-ocnM8Jd2F9w7AA@xxxxxxxxxxxxxx >> >> This is already fixed with prctl() support. See: >> >> https://lore.kernel.org/linux-fsdevel/20191112001900.9206-1-mchristi@xxxxxxxxxx/ > > As I mentioned above, IMHO, we could add some per-task state to avoid > the majority of such deadlock cases (also what I mentioned above), but > there may still some potential dependency could happen between threads, > such as using another kernel workqueue and waiting on it (in principle > at least) since userspace program can call any syscall in principle ( > which doesn't like in-kernel drivers). So I think it can cause some > risk due to generic userspace block device restriction, please kindly > correct me if I'm wrong. Not sure what you mean with all this. prctl() works per process/thread and a context that has PR_SET_IO_FLUSHER set will have PF_MEMALLOC_NOIO set. So for the case of a user block device driver, setting this means that it cannot reenter itself during a memory allocation, regardless of the system call it executes (FS etc): all memory allocations in any syscall executed by the context will have GFP_NOIO. If the kernel-side driver for the user block device driver does any allocation that does not have GFP_NOIO, or cause any such allocation (e.g. within a workqueue it is waiting for), then that is a kernel bug. Block device drivers are not supposed to ever do a memory allocation in the IO hot path without GFP_NOIO. > > Thanks, > Gao Xiang > >> >> >> -- >> Damien Le Moal >> Western Digital Research -- Damien Le Moal Western Digital Research