Ming Lei <ming.lei@xxxxxxxxxx> writes: > IMO it needn't 'inverse io_uring', the normal io_uring SQE/CQE model > does cover this case, the userspace part can submit SQEs beforehand > for getting notification of each incoming io request from kernel driver, > then after one io request is queued to the driver, the driver can > queue a CQE for the previous submitted SQE. Recent posted patch of > IORING_OP_URING_CMD[1] is perfect for such purpose. > > I have written one such userspace block driver recently, and [2] is the > kernel part blk-mq driver(ubd driver), the userspace part is ubdsrv[3]. > Both the two parts look quite simple, but still in very early stage, so > far only ubd-loop and ubd-null targets are implemented in [3]. Not only > the io command communication channel is done via IORING_OP_URING_CMD, but > also IO handling for ubd-loop is implemented via plain io_uring too. > > It is basically working, for ubd-loop, not see regression in 'xfstests -g auto' > on the ubd block device compared with same xfstests on underlying disk, and > my simple performance test on VM shows the result isn't worse than kernel loop > driver with dio, or even much better on some test situations. Thanks for sharing. This is a very interesting implementation that seems to cover quite well the original use case. I'm giving it a try and will report back. > Wrt. this userspace block driver things, I am more interested in the following > sub-topics: > > 1) zero copy > - the ubd driver[2] needs one data copy: for WRITE request, copy pages > in io request to userspace buffer before handling the WRITE IO by ubdsrv; > for READ request, the reverse copy is done after READ request is > handled by ubdsrv > > - I tried to apply zero copy via remap_pfn_range() for avoiding this > data copy, but looks it can't work for ubd driver, since pages in the > remapped vm area can't be retrieved by get_user_pages_*() which is called in > direct io code path > > - recently Xiaoguang Wang posted one RFC patch[4] for support zero copy on > tcmu, and vm_insert_page(s)_mkspecial() is added for such purpose, but > it has same limit of remap_pfn_range; Also Xiaoguang mentioned that > vm_insert_pages may work, but anonymous pages can not be remapped by > vm_insert_pages. > > - here the requirement is to remap either anonymous pages or page cache > pages into userspace vm, and the mapping/unmapping can be done for > each IO runtime. Is this requirement reasonable? If yes, is there any > easy way to implement it in kernel? I've run into the same issue with my fd implementation and haven't been able to workaround it. > 4) apply eBPF in userspace block driver > - it is one open topic, still not have specific or exact idea yet, > > - is there chance to apply ebpf for mapping ubd io into its target handling > for avoiding data copy and remapping cost for zero copy? I was thinking of something like this, or having a way for the server to only operate on the fds and do splice/sendfile. But, I don't know if it would be useful for many use cases. We also want to be able to send the data to userspace, for instance, for userspace networking. -- Gabriel Krisman Bertazi