On 2022/5/3 16:02, Ming Lei wrote: > Hello Gabriel, > > CC linux-block and hope you don't mind, :-) > > On Mon, May 02, 2022 at 01:41:13PM -0400, Gabriel Krisman Bertazi wrote: >> >> Hi Ming, >> >> First of all, I hope I didn't put you on the spot too much during the >> discussion. My original proposal was to propose my design, but your >> implementation quite solved the questions I had. :) > > I think that is open source, then we can put efforts together to make things > better. > >> >> I'd like to follow up to restart the communication and see >> where I can help more with UBD. As I said during the talk, I've >> done some fio runs and I was able to saturate NBD much faster than UBD: >> >> https://people.collabora.com/~krisman/mingl-ubd/bw.png > > Yeah, that is true since NBD has extra socket communication cost which > can't be efficient as io_uring. > >> >> I've also wrote some fixes to the initialization path, which I >> planned to send to you as soon as you published your code, but I think >> you might want to take a look already and see if you want to just squash >> it into your code base. >> >> I pushed those fixes here: >> >> https://gitlab.collabora.com/krisman/linux/-/tree/mingl-ubd > > I have added the 1st fix and 3rd patch into my tree: > > https://github.com/ming1/linux/commits/v5.17-ubd-dev > > The added check in 2nd patch is done lockless, which may not be reliable > enough, so I didn't add it. Also adding device is in slow path, and no > necessary to improve in that code path. > > I also cleaned up ubd driver a bit: debug code cleanup, remove zero copy > code, remove command of UBD_IO_GET_DATA and always make ubd driver > builtin. > > ubdsrv part has been cleaned up too: > > https://github.com/ming1/ubdsrv > >> >> I'm looking into adding support for multiple driver queues next, and >> should be able to share some patches on that shortly. > > OK, please post them on linux-block so that more eyes can look at the > code, meantime the ubdsrv side needs to handle MQ too. > > Sooner or later, the single ubdsrv task may be saturated by copying data and > io_uring command communication only, which can be shown by running io on > ubd-null target. In my lattop, the ubdsrv cpu utilization is close to > 90% when IOPS is > 500K. So MQ may help some fast backing cases. > > > Thanks, > Ming Hi Ming, Now I am learning your userspace block driver(UBD) [1][2] and we plan to replace TCMU by UBD as a new choice for implementing userspace bdev for its high performance and simplicity. First, we have conducted some tests by fio and perf to evaluate UBD. 1) UBD achieves higher throughput than TCMU. We think TCMU suffers from the complicated SCSI layer and does not support multiqueue. However UBD is simply using io_uring passthrough and may support multiqueue in the future.(Note that even with a single queue now , UBD outperforms TCMU) 2) Some functions in UBD result in high CPU utilization and we guess they also lower throughput. For example, ubdsrv_submit_fetch_commands() frequently iterates on the array of UBD IOs and wastes CPU when no IO is ready to be submitted. Besides, ubd_copy_pages() asks CPU to copy data between bio vectors and UBD internal buffers while handling write and read requests and it could be eliminated by supporting zero-copy. Second, I'd like to share some ideas on UBD. I'm not sure if they are reasonable so please figure out my mistakes. 1) UBD issues one sqe to commit last completed request and fetch a new one. Then, blk-mq's queue_rq() issues a new UBD IO request and completes one cqe for the fetch command. We have evaluated that io_submit_sqes() costs some CPU and steps of building a new sqe may lower throughput. Here I'd like to give a new solution: never submit sqe but trump up a cqe(with information of new UBD IO request) when calling queue_rq(). I am inspired by one io_uring flag: IORING_POLL_ADD_MULTI, with which a user issues only one sqe for polling an fd and repeatedly gets multiple cqes when new events occur. Dose this solution break the architecture of UBD? 2) UBDSRV(the userspace part) should not allocate data buffers itself. When an application configs many queues with bigger iodepth, UBDSRV has to preallocate more buffers(size = 256KiB) and results in heavy memory overhead. I think data buffers should be allocated by applications themselves and passed to UBDSRV. In this way UBD offers more flexibility. However, while handling a write request, the control flow returns to the kernel part again to set buf addr and copy data from bio vectors. Is ioctl helpful by setting buf addr and copying write data to app buf? 3) ubd_fetch_and_submit() frequently iterates on the array of ubd IOs and wastes CPU when no IO is ready to be submitted. I think it can be optimized by adding a new array storing UBD IOs that are ready to be commit back to the kernel part. Then we could batch these IOs and avoid unnecessary iterations on IOs which are not ready(fetching or handling by targets). 4) Zero-copy support is important and we are trying to implement it now. 5) Currently, UBD only support the loop target with io_uirng and all works(1.get one cqe 2.issue target io_uring IO 3.get target io_uring IO completion 4.prepare one sqe) are done in one thread. As far as I know, some applications such as SPDK, network fs and customized distribution systems do not support io_uring well. I think we should separate target IO handling from the UBDSRV loop and allow applications handle target IOs themselves. Is this suggestion reasonable? (Or UBD should focus on io_uring-supported targets?) Regards, Zhang [1] https://github.com/ming1/ubdsrv [2]https://github.com/ming1/linux/commits/v5.17-ubd-dev