On Mon, Mar 28, 2022 at 04:20:03PM -0400, Gabriel Krisman Bertazi wrote: > Ming Lei <ming.lei@xxxxxxxxxx> writes: > > > IMO it needn't 'inverse io_uring', the normal io_uring SQE/CQE model > > does cover this case, the userspace part can submit SQEs beforehand > > for getting notification of each incoming io request from kernel driver, > > then after one io request is queued to the driver, the driver can > > queue a CQE for the previous submitted SQE. Recent posted patch of > > IORING_OP_URING_CMD[1] is perfect for such purpose. > > > > I have written one such userspace block driver recently, and [2] is the > > kernel part blk-mq driver(ubd driver), the userspace part is ubdsrv[3]. > > Both the two parts look quite simple, but still in very early stage, so > > far only ubd-loop and ubd-null targets are implemented in [3]. Not only > > the io command communication channel is done via IORING_OP_URING_CMD, but > > also IO handling for ubd-loop is implemented via plain io_uring too. > > > > It is basically working, for ubd-loop, not see regression in 'xfstests -g auto' > > on the ubd block device compared with same xfstests on underlying disk, and > > my simple performance test on VM shows the result isn't worse than kernel loop > > driver with dio, or even much better on some test situations. > > Thanks for sharing. This is a very interesting implementation that > seems to cover quite well the original use case. I'm giving it a try and > will report back. > > > Wrt. this userspace block driver things, I am more interested in the following > > sub-topics: > > > > 1) zero copy > > - the ubd driver[2] needs one data copy: for WRITE request, copy pages > > in io request to userspace buffer before handling the WRITE IO by ubdsrv; > > for READ request, the reverse copy is done after READ request is > > handled by ubdsrv > > > > - I tried to apply zero copy via remap_pfn_range() for avoiding this > > data copy, but looks it can't work for ubd driver, since pages in the > > remapped vm area can't be retrieved by get_user_pages_*() which is called in > > direct io code path > > > > - recently Xiaoguang Wang posted one RFC patch[4] for support zero copy on > > tcmu, and vm_insert_page(s)_mkspecial() is added for such purpose, but > > it has same limit of remap_pfn_range; Also Xiaoguang mentioned that > > vm_insert_pages may work, but anonymous pages can not be remapped by > > vm_insert_pages. > > > > - here the requirement is to remap either anonymous pages or page cache > > pages into userspace vm, and the mapping/unmapping can be done for > > each IO runtime. Is this requirement reasonable? If yes, is there any > > easy way to implement it in kernel? > > I've run into the same issue with my fd implementation and haven't been > able to workaround it. > > > 4) apply eBPF in userspace block driver > > - it is one open topic, still not have specific or exact idea yet, > > > > - is there chance to apply ebpf for mapping ubd io into its target handling > > for avoiding data copy and remapping cost for zero copy? > > I was thinking of something like this, or having a way for the server to > only operate on the fds and do splice/sendfile. But, I don't know if it > would be useful for many use cases. We also want to be able to send the > data to userspace, for instance, for userspace networking. I understand the big point is that how to pass the io data to ubd driver's request/bio pages. But splice/sendfile just transfers data between two FDs, then how can the block request/bio's pages get filled with expected data? Can you explain a bit in detail? If block layer is bypassed, it won't be exposed as block disk to userspace. thanks, Ming