On Thu, Jun 30, 2022 at 05:29:07PM +0800, Ziyang Zhang wrote: > On 2022/6/30 17:09, Ming Lei wrote: > > On Thu, Jun 30, 2022 at 03:16:21PM +0800, Ziyang Zhang wrote: > >> Hi, Ming > >> > >> On 2022/6/29 19:33, Ming Lei wrote: > >>> On Wed, Jun 29, 2022 at 11:22:23AM +0800, Ziyang Zhang wrote: > >>>> Hi Ming, > >>>> > >>>> On 2022/6/27 23:29, Ming Lei wrote: > >>>>> Hi Ziyang, > >>>>> > >>>>> On Mon, Jun 27, 2022 at 04:20:55PM +0800, Ziyang Zhang wrote: > >>>>>> Hi Ming, > >>>>>> > >>>>>> We are learning your ubd code and developing a library: libubd for ubd. > >>>>>> This article explains why we need libubd and how we design it. > >>>>>> > >>>>>> Related threads: > >>>>>> (1) https://lore.kernel.org/all/Yk%2Fn7UtGK1vVGFX0@T590/ > >>>>>> (2) https://lore.kernel.org/all/YnDhorlKgOKiWkiz@T590/ > >>>>>> (3) https://lore.kernel.org/all/20220509092312.254354-1-ming.lei@xxxxxxxxxx/ > >>>>>> (4) https://lore.kernel.org/all/20220517055358.3164431-1-ming.lei@xxxxxxxxxx/ > >>>>>> > >>>>>> > >>>>>> Userspace block driver(ubd)[1], based on io_uring passthrough, > >>>>>> allows users to define their own backend storage in userspace > >>>>>> and provides block devices such as /dev/ubdbX. > >>>>>> Ming Lei has provided kernel driver code: ubd_drv.c[2] > >>>>>> and userspace code: ubdsrv[3]. > >>>>>> > >>>>>> ubd_drv.c simply passes all blk-mq IO requests > >>>>>> to ubdsrv through io_uring sqes/cqes. We think the kernel code > >>>>>> is pretty well-designed. > >>>>>> > >>>>>> ubdsrv is implemented by a single daemon > >>>>>> and target(backend) IO handling(null_tgt and loop_tgt) > >>>>>> is embedded in the daemon. > >>>>>> While trying ubdsrv, we find ubdsrv is hard to be used > >>>>>> by our backend. > >>>>> > >>>>> ubd is supposed to provide one generic framework for user space block > >>>>> driver, and it can be used for doing lots of fun/useful thing. > >>>>> > >>>>> If I understand correctly, this isn't same with your use case: > >>>>> > >>>>> 1) your user space block driver isn't generic, and should be dedicated > >>>>> for Alibaba's uses > >>>>> > >>>>> 2) your case has been there for long time, and you want to switch from other > >>>>> approach(maybe tcmu) to ubd given ubd has better performance. > >>>>> > >>>> > >>>> Yes, you are correct :) > >>>> The idea of design libubd is actually from libtcmu. > >>>> > >>>> We do have some userspace storage system as the IO handling backend, > >>>> and we need ubd to provide block drivers such as /dev/ubdbX for up layer client apps. > >>>> > >>>> > >>>> I think your motivation is that provides a complete user block driver to users > >>>> and they DO NOT change any code. > >>>> Users DO change their code using libubd for embedding libubd into the backend. > >>>> > >>>> > >>>>>> First is description of our backend: > >>>>>> > >>>>>> (1) a distributing system sends/receives IO requests > >>>>>> through network. > >>>>>> > >>>>>> (2) The system use RPC calls among hundreds of > >>>>>> storage servers and RPC calls are associated with data buffers > >>>>>> allocated from a memory pool. > >>>>>> > >>>>>> (3) On each server for each device(/dev/vdX), our backend runs > >>>>>> many threads to handle IO requests and manage the device. > >>>>>> > >>>>>> Second are reasons why ubdsrv is hard to use for us: > >>>>>> > >>>>>> (1) ubdsrv requires the target(backend) issues IO requests > >>>>>> to the io_uring provided by ubdsrv but our backend > >>>>>> uses something like RPC and does not support io_uring. > >>>>> > >>>>> As one generic framework, the io command has to be io_uring > >>>>> passthrough, and the io doesn't have to be handled by io_uring. > >>>> > >>>> Yes, our backend define its own communicating method. > >>>> > >>>>> > >>>>> But IMO io_uring is much more efficient, so I'd try to make async io > >>>>> (io uring) as the 1st citizen in the framework, especially for new > >>>>> driver. > >>>>> > >>>>> But it can support other way really, such as use io_uring with eventfd, > >>>>> the other userspace context can handle io, then wake up io_uring context > >>>>> via eventfd. You may not use io_uring for handling io, but you still > >>>>> need to communicate with the context for handling io_uring passthrough > >>>>> command, and one mechanism(such as eventfd) has to be there for the > >>>>> communication. > >>>> > >>>> Ok, eventfd may be helpful. > >>>> If you read my API, you may find ubdlib_complete_io_request(). > >>>> I think the backend io worker thread can call this function to tell the > >>>> ubd queue thread(the io_uring context in it) to commit the IO. > >>> > >>> The ubdlib_complete_io_request() has to be called in the same pthread > >>> context, that looks not flexible. When you handle IO via non-io_uring in the same > >>> context, the cpu utilization in submission/completion side should be > >>> higher than io_uring. And this way should be worse than the usage in > >>> ubd/loop, that is why I suggest to use one io_uring for handling both > >>> io command and io request if possible. > >> > >> ubdlib_complete_io_request() can be called in the io worker thread, > >> not in the ubdsrv queue thread(with the io_uring context for handling uring_cmd). > >> > >> You can find ubd_runner.c in my libubd repo. There are many io worker > >> threads for each ubdsrv queue to handle IO requests. > >> > >> Actually this idea comes from tcmu-runner. The data flow is: > >> > >> 1) in ubdsrv queue thread, io_uring_enter(): returns(IO reqs received from blk-mq) > >> > >> 2) in ubdsrv queue thread, ubdsrv_reap_requests(): iterate on each cqe(with an IO req), > >> > >> for READ/WRITE requests, ubd_aio_queue_io() to enqueue the IO req into a io_queue > >> (each ubdsrv queue has one io_queue). This IO req's status is IO_HANDLING_ASYNC. > >> > >> for other simple(can be handled very quickly), > >> handle it right now and call ubdlib_complete_io_request() > >> > >> 3) in ubdsrv queue thread, ubdsrv_commit_and_fetch(): iterate on all IO slots per ubdsrv queue > >> and setup sqe if one IO(IO completion) is ready to commit. > >> > >> Here, some IO slots are still IO_HANDLING_ASYNC so no sqe is generated for them. > >> > >> > >> 4) in ubdsrv queue thread, io_uring_enter(): submit all sqes and wait for cqes > >> (io_uring_enter() will return after at least one IO req is received from blk-mq) > >> > >> 5) When 3) or 4) happens, at the same time in ubdsrv queue IO worker threads: > >> each io worker thread try to deque and handle one IO req from io_queue per ubdsrv queue. > >> > >> After the IO worker handles the IO req(WRITE/READ), it calls ubdlib_complete_io_request() > >> This function can mark this IO req's status to ready to commit. > >> > >> IO handling/completion and io_uring_enter() can happen at the same time. > >> > >> Besides, io_uring_enter can: > >> > >> 1) block and wait for cqes until at least > >> one blk-mq req comes from queue_rq() > >> > >> 2) submit sqes(with last IO completion and next fetch) > >> > >> so I have to consider how to notify io_uring about io completion > >> after io_uring_enter() is slept(block and wait for cqes). > > > > Yeah, that was exactly my question, :-) > > > >> > >> In current version of ubd_runner(an async libubd target), I try to use an "unblock" > >> io_uring_enter_timeout() and caller can set a timeout value for it. > >> So IO completions happen after io_uring_enter_timeout() call can be committed > >> by next io_uring_enter_timeout() call... > >> > >> But this is a very ugly implementation > >> because I may waste CPU on useless loops in ubdsrv queue thread if > >> blk-mq reqs do not income frequently. > >> > >> You mentioned that eventfd may be helpful and I agree with you. :) > >> I can register an eventfd in io_uring after ubd_aio_queue_io() and write the eventfd > >> in ubdlib_complete_io_request(). > >> > >> I will fix my code. > > > > FYI, there is one example about using eventfd to wakeup io_uring, which > > can be added to the library for your usecase: > > > > https://gist.github.com/1Jo1/6496d1b8b6b363c301271340e2eab95b > > Thanks, will take a view. > > > > >> > >>> > >>>> > >>>> > >>>> > >>>>> > >>>>>> > >>>>>> (2) ubdsrv forks a daemon and it takes over everything. > >>>>>> Users should type "list/stop/del" ctrl-commands to interact with > >>>>>> the daemon. It is inconvenient for our backend > >>>>>> because it has threads(from a C++ thread library) running inside. > >>>>> > >>>>> No, list/stop/del won't interact with the daemon, and the per-queue > >>>>> pthread is only handling IO commands(io_uring passthrough) and IO request. > >>>>> > >>>> > >>>> > >>>> Sorry I made a mistake. > >>>> > >>>> I mean from user's view, > >>>> he has to type list/del/stop from cmdlind to control the daemon. > >>>> (I know the control flow is cmdline-->ubd_drv.c-->ubdsrv daemon). > >>>> > >>>> This is a little weird if we try to make a ubd library. > >>>> So I actually provides APIs in libubd for users to do these list/del/stop works. > >>> > >>> OK, that is fine to export APIs for admin purpose. > >>> > >>>> > >>>> > >>>>>> > >>>>>> (3) ubdsrv PRE-allocates internal data buffers for each ubd device. > >>>>>> The data flow is: > >>>>>> bio vectors <-1-> ubdsrv data buffer <-2-> backend buffer(our RPC buffer). > >>>>>> Since ubdsrv does not export its internal data buffer to backend, > >>>>>> the second copy is unavoidable. > >>>>>> PRE-allocating data buffer may not be a good idea for wasting memory > >>>>>> if there are hundreds of ubd devices(/dev/ubdbX). > >>>>> > >>>>> The preallocation is just virtual memory, which is cheap and not pinned, but > >>>>> ubdsrv does support buffer provided by io command, see: > >>>>> > >>>>> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d > >>>> > >>>> Actually I discussed on the design of pre-allocation in your RFC patch for ubd_drv > >>>> but you did not reply :) > >>>> > >>>> I paste it here: > >>>> > >>>> "I am worried about the fixed-size(size is max io size, 256KiB) pre-allocated data buffers in UBDSRV > >>>> may consume too much memory. Do you mean these pages can be reclaimed by sth like madvise()? > >>>> If (1)swap is not set and (2)madvise() is not called, these pages may not be reclaimed." > >>>> > >>>> I observed that your ubdsrv use posix_memalign() to pre-allocate data buffers, > >>>> and I have already noticed the memory cost while testing your ubdsrv with hundreds of /dev/ubdbX. > >>> > >>> Usually posix_memalign just allocates virtual memory which is unlimited > >>> in 64bit arch, and pages should be allocated until the buffer is read or write. > >>> After the READ/WRITE is done, kernel still can reclaim the pages in this > >>> virtual memory. > >>> > >>> In future, we still may optimize the memory uses via madvise, such as > >>> MADV_DONTNEED, after the slot is idle for long enough. > >> > >> Ok, thanks for explanation. > >> > >>> > >>>> > >>>> Another IMPORTANT problem is your commit: > >>>> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d > >>>> may be not helpful for WRITE requests if I understand correctly. > >>>> > >>>> Consider this data flow: > >>>> > >>>> 1. ubdsrv commits an IO req(req1, a READ req). > >>>> > >>>> 2. ubdsrv issues a sqe(UBD_IO_COMMIT_AND_FETCH_REQ), and sets io->addr to addr1. > >>>> addr1 is the addr of buffer user passed. > >>>> > >>>> > >>>> 3. ubd gets the sqe and commits req1, sets io->addr to addr1. > >>>> > >>>> 4. ubd gets IO req(req2, a WRITE req) from blk-mq(queue_rq) and commit a cqe. > >>>> > >>>> 5. ubd copys data to be written from biovec to addr1 in a task_work. > >>>> > >>>> 6. ubdsrv gets the cqe and tell the IO target to handle req2. > >>>> > >>>> 7. IO target handles req2. It is a WRITE req so target issues a io_uring write > >>>> cmd(with buffer set to addr1). > >>>> > >>>> > >>>> > >>>> The problem happens in 5). You cannot know the actual data_len of an blk-mq req > >>>> until you get one in queue_rq. So length of addr1 may be less than data_len. > >>> > >>> So far, the actual length of buffer has to be set as at least rq_max_blocks, since > >>> we set it as ubd queue's max hw sectors. Yeah, you may argue memory > >>> waste, but process virtual address is unlimited for 64bit arch, and > >>> pages are allocated until actual read/write is started. > >> > >> Ok, since I allow users to config rq_max_blocks in libubd, > >> it's users' responsibility to ensure length of user buffers > >> is at least rq_max_blocks. > >> > >> Now I agree on your commit: > >> https://github.com/ming1/linux/commit/0a964a1700e11ba50227b6d633edf233bdd8a07d > >> > >> Provide WRITE buffer in advance(when sending COMMIT_AND_FETCH) seems OK :) > >> > >>> > >>>>> > >>>>>> > >>>>>> To better use ubd in more complicated scenarios, we have developed libubd. > >>>>>> It does not assume implementation of backend and can be embedded into it. > >>>>>> We refer to the code structure of tcmu-runner[4], > >>>>>> which includes a library(libtcmu) for users > >>>>>> to embed tcmu-runner inside backend's code. > >>>>>> It: > >>>>>> > >>>>>> (1) Does not fork/pthread_create but embedded in backend's threads > >>>>> > >>>>> That is because your backend may not use io_uring, I guess. > >>>>> > >>>>> But it is pretty easy to move the decision of creating pthread to target > >>>>> code, which can be done in the interface of .prepare_target(). > >>>> > >>>> I think the library should not create any thread if we want a libubd. > >>> > >>> I Agree. > >>> > >>>> > >>>>> > >>>>>> > >>>>>> (2) Provides libubd APIs for backend to add/delete ubd devices > >>>>>> and fetch/commit IO requests > >>>>> > >>>>> The above could be the main job of libubd. > >>>> > >>>> indeed. > >>>> > >>>>> > >>>>>> > >>>>>> (3) simply passes backend-provided data buffers to ubd_drv.c in kernel, > >>>>>> since the backend actually has no knowledge > >>>>>> on incoming data size until it gets an IO descriptor. > >>>>> > >>>>> I can understand your requirement, not look at your code yet, but libubd > >>>>> should be pretty thin from function viewpoint, and there are lots of common > >>>>> things to abstract/share among all drivers, please see recent ubdsrv change: > >>>>> > >>>>> https://github.com/ming1/ubdsrv/commits/master > >>>>> > >>>>> in which: > >>>>> - coroutine is added for handling target io > >>>>> - the target interface(ubdsrv_tgt_type) has been cleaned/improved for > >>>>> supporting complicated target > >>>>> - c++ support > >>>> > >>>> Yes, I have read your coroutine code but I am not an expert of C++ 20.:( > >>>> I think it is actually target(backend) design and ubd should not assume > >>>> how the backend handle IOs. > >>>> > >>>> The work ubd in userspace has to be done is: > >>>> > >>>> 1) give some IO descriptors to backend, such as ubd_get_io_requests() > >>>> > >>>> 2) get IO completion form backend, such as ubd_complete_io_requests() > >>> > >>> Or the user provides/registers two callbacks: handle_io_async() and > >>> io_complete(), the former is called when one request comes from ubd > >>> driver, the latter(optional) is called when one io is done. > >>> > >>> Also you didn't mention how you notify io_uring about io completion after > >>> io_uring_enter() is slept if your backend code doesn't use io_uring to > >>> handle io. > >>> > >>> I think one communication mechanism(such as eventfd) is needed for your > >>> case. > >> > >> Ok, I will try eventfd with io_uring. > >> > >>> > >>>> > >>>> > >>>> > >>>>> > >>>>> IMO, libubd isn't worth of one freshly new project, and it could be integrated > >>>>> into ubdsrv easily. The potential users could be existed usersapce > >>>>> block driver projects. > >>>> > >>>> Yes, so many userspace storage systems can use ubd! > >>>> You may look at tcmu-runner. It: > >>>> > >>>> 1) provides a library(libtcmu.c) for those who have a existing backend. > >>>> > >>>> 2) provides a runner(main.c in tcmu-runner) like your ubdsrv > >>>> for those who just want to run it. > >>>> And the runner is build on top of libtcmu. > >>>> > >>>>> > >>>>> If you don't object, I am happy to co-work with you to add the support > >>>>> for libubd in ubdsrv, then we can avoid to invent a wheel > >>>> > >>>> +1 :) > >>> > >>> Thinking of further, I'd suggest to split ubdsrv into two parts: > >>> > >>> 1) libubdsrv > >>> - provide APIs like what you did in libubd > >>> - provide API for notify io_uring(handling io command) that one io is > >>> completed, and the API should support handling IO from other context > >>> (not same with the io_uring context for handling io command). > >>> > >>> 2) ubd target > >>> - built on libubdsrv, such as ubd command is built on libubdsrv, and > >>> specific target implementation is built on the library too. > >>> > >>> It shouldn't be hard to work towards this direction, and I guess this > >>> way should make current target implementation more clean. > >>> > >> > >> Yes, this is like tcmu-runner's structure: a libtcmu and some target > >> Thanks, Ming. Glad to co-work with you. > >> > >> I will take your advice and improve libubd(the communication mechanism, maybe eventfd). > > > > I have added libublk branch for working towards this direction, if we > > cowork on libublk, please write patch against this branch, then I can > > apply your patch directly. > > > > https://github.com/ming1/ubdsrv/tree/libublk > > Ok, but It concerns me that libubdsrv may change current ubdsrv project's structure a lot > because: > 1) target implementation will be built on top of libubdsrv and the target > should create pthread(ubdsrv loop) itself. > > 2) have to remove pthread/process(daemon) in current ubdsrv to build libubdsrv. > It was really a hard job. :-( Both the two are not hard to do, and turns out that making libubdsrv is actually one big cleanup. All these works[1] are basically done: 1) libublksrv - built .so and .a are under lib/ - exported header file is include/ublksrv.h - so any other application can make ublk device against this library - eventfd notification is added too, so io handling doesn't have to be done via io_uring, one callback of ->handle_event(), and two APIs are added for this support 2) ublk/ubd utility - built against libublksrv, meantime it uses the private header of the library too, which is fine, since the two are in same project 3) two examples - demo_null.c: one < 200 LOC standalone example to show how to make a ubd/null block device against libublksrv - demo_event.c: one simple standalone example(~300LOC) to make one ublk disk by handling io via another pthread(not by io_uring) against libublksrv Any comments/feedback/tests are welcome. [1] libublk https://github.com/ming1/ubdsrv/tree/libublk Thanks, Ming