Re: Follow up on UBD discussion

Xiaoguang Wang <xiaoguang.wang@xxxxxxxxxxxxxxxxx> · Mon, 9 May 2022 12:05:54 +0800

hi,

> On Sat, May 07, 2022 at 12:20:17PM +0800, ZiyangZhang wrote:
>> On 2022/5/3 16:02, Ming Lei wrote:
>>> Hello Gabriel,
>>>
>>> CC linux-block and hope you don't mind, :-)
>>>
>>> On Mon, May 02, 2022 at 01:41:13PM -0400, Gabriel Krisman Bertazi wrote:
>>>> Hi Ming,
>>>>
>>>> First of all, I hope I didn't put you on the spot too much during the
>>>> discussion.  My original proposal was to propose my design, but your
>>>> implementation quite solved the questions I had. :)
>>> I think that is open source, then we can put efforts together to make things
>>> better.
>>>
>>>> I'd like to follow up to restart the communication and see
>>>> where I can help more with UBD.  As I said during the talk, I've
>>>> done some fio runs and I was able to saturate NBD much faster than UBD:
>>>>
>>>> https://people.collabora.com/~krisman/mingl-ubd/bw.png
>>> Yeah, that is true since NBD has extra socket communication cost which
>>> can't be efficient as io_uring.
>>>
>>>> I've also wrote some fixes to the initialization path, which I
>>>> planned to send to you as soon as you published your code, but I think
>>>> you might want to take a look already and see if you want to just squash
>>>> it into your code base.
>>>>
>>>> I pushed those fixes here:
>>>>
>>>>   https://gitlab.collabora.com/krisman/linux/-/tree/mingl-ubd
>>> I have added the 1st fix and 3rd patch into my tree:
>>>
>>> https://github.com/ming1/linux/commits/v5.17-ubd-dev
>>>
>>> The added check in 2nd patch is done lockless, which may not be reliable
>>> enough, so I didn't add it. Also adding device is in slow path, and no
>>> necessary to improve in that code path.
>>>
>>> I also cleaned up ubd driver a bit: debug code cleanup, remove zero copy
>>> code, remove command of UBD_IO_GET_DATA and always make ubd driver
>>> builtin.
>>>
>>> ubdsrv part has been cleaned up too:
>>>
>>> https://github.com/ming1/ubdsrv
>>>
>>>> I'm looking into adding support for multiple driver queues next, and
>>>> should be able to share some patches on that shortly.
>>> OK, please post them on linux-block so that more eyes can look at the
>>> code, meantime the ubdsrv side needs to handle MQ too.
>>>
>>> Sooner or later, the single ubdsrv task may be saturated by copying data and
>>> io_uring command communication only, which can be shown by running io on
>>> ubd-null target. In my lattop, the ubdsrv cpu utilization is close to
>>> 90% when IOPS is > 500K. So MQ may help some fast backing cases.
>>>
>>>
>>> Thanks,
>>> Ming
>> Hi Ming,
>>
>> Now I am learning your userspace block driver(UBD) [1][2] and we plan to
>> replace TCMU by UBD as a new choice for implementing userspace bdev for
>> its high performance and simplicity.
>>
>> First, we have conducted some tests by fio and perf to evaluate UBD.
>>
>> 1) UBD achieves higher throughput than TCMU. We think TCMU suffers from
>>      the complicated SCSI layer and does not support multiqueue. However
>> UBD is simply using io_uring passthrough and may support multiqueue in
>> the future.(Note that even with a single queue now , UBD outperforms TCMU)
> MQ isn't hard to support, and it is basically workable now:
>
> https://github.com/ming1/ubdsrv/commits/devel
> https://github.com/ming1/linux/commits/my_for-5.18-ubd-devel
>
> Just the affinity of pthread for each queue isn't setup yet.
>
>> 2) Some functions in UBD result in high CPU utilization and we guess
>> they also lower throughput. For example, ubdsrv_submit_fetch_commands()
>> frequently iterates on the array of UBD IOs and wastes CPU when no IO is
>> ready to be submitted. Besides,  ubd_copy_pages() asks CPU to copy data
>> between bio vectors and UBD internal buffers while handling write and
>> read requests and it could be eliminated by supporting zero-copy.
> copy itself doesn't take much cpu, see the following trace:
>
> -   34.36%     3.73%  ubd              [kernel.kallsyms]             [k] ubd_copy_pages.isra.0                               ▒
>    - 30.63% ubd_copy_pages.isra.0                                                                                            ▒
>       - 23.86% internal_get_user_pages_fast                                                                                  ▒
>          + 21.14% get_user_pages_unlocked                                                                                    ▒
>          + 2.62% lockless_pages_from_mm                                                                                      ▒
>         6.42% ubd_release_pages.constprop.0
>
> And we may provide option to allow to pin pages in the disk lifetime for avoiding
> the cost in _get_user_pages_fast().
>
> zero-copy has to touch page table, and its cost may be expensive too.
> The big problem is that MM doesn't provide mechanism to support generic
> remapping kernel pages to userspace.
>
>> Second, I'd like to share some ideas on UBD. I'm not sure if they are
>> reasonable so please figure out my mistakes.
>>
>> 1) UBD issues one sqe to commit last completed request and fetch a new
>> one. Then, blk-mq's queue_rq() issues a new UBD IO request and completes
>> one cqe for the fetch command. We have evaluated that io_submit_sqes()
>> costs some CPU and steps of building a new sqe may lower throughput.
>> Here I'd like to give a new solution: never submit sqe but trump up a
>> cqe(with information of new UBD IO request) when calling queue_rq(). I
>> am inspired by one io_uring flag: IORING_POLL_ADD_MULTI, with which a
>> user issues only one sqe for polling an fd and repeatedly gets multiple
>> cqes when new events occur. Dose this solution break the architecture of
>> UBD?
> But each cqe has to be associated with one sqe, if I understand
> correctly.
Yeah, for current io_uring implementation, it is. But if io_uring offers below
helper:
void io_gen_cqe_direct(struct file *file, u64 user_data, s32 res, u32 cflags)
{
        struct io_ring_ctx *ctx;
        ctx = file->private_data;

        spin_lock(&ctx->completion_lock);
        __io_fill_cqe(ctx, user_data, res, cflags);
        io_commit_cqring(ctx);
        spin_unlock(&ctx->completion_lock);
        io_cqring_ev_posted(ctx);
}

Then in ubd driver:
1) device setup stage
We attach io_uring files and user_data to every ubd hard queue.

2) when blk-mq->queue_rq() is called.
io_gen_cqe_direct() will be called in ubd's queue_rq, and we put ubd io request's
qid and tag info into cqe's res field, then we don't need to issue sqe to fetch io cmds.

Regards,
Xiaoguang Wang
>
> I will research IORING_POLL_ADD_MULTI a bit and see if it can help UBD.
> And yes, batching is really important for UBD's performance.
>
>> 2) UBDSRV(the userspace part) should not allocate data buffers itself.
>> When an application configs many queues with bigger iodepth, UBDSRV has
>> to preallocate more buffers(size = 256KiB) and results in heavy memory
>> overhead. I think data buffers should be allocated by applications
> That is just virtual memory, and pages can be reclaimed after IO is
> done.
>
>> themselves and passed to UBDSRV. In this way UBD offers more
>> flexibility. However, while handling a write request, the control flow
>> returns to the kernel part again to set buf addr and copy data from bio
>> vectors. Is ioctl helpful by setting buf addr and copying write data to
>> app buf?
> It is pretty easy to pass application buffer to UBD_IO_FETCH_REQ or
> UBD_IO_COMMIT_AND_FETCH_REQ, just by overriding ios[i].buf_addr which
> is sent to ubd driver via ubdsrv_io_cmd->addr.
>
> No need any ioctl, and io_uring command can handle everything.
>
> I think the idea is good, and we can provide one option for using
> pre-allocated buffer or application buffer.
>
> But the application buffer has to be in same process VM space with ubdsrv
> daemon, otherwise it becomes slower to pin these application
> buffers/pages.
>
>> 3) ubd_fetch_and_submit() frequently iterates on the array of ubd IOs
>> and wastes CPU when no IO is ready to be submitted. I think it can be
>> optimized by adding a new array storing UBD IOs that are ready to be
>> commit back to the kernel part. Then we could batch these IOs and avoid
>> unnecessary iterations on IOs which are not ready(fetching or handling
>> by targets).
> That should be easy to avoid the whole queue iteration, but my perf
> trace doesn't show ubd_fetch_and_submit() consumes too much CPU.
>
>> 4) Zero-copy support is important and we are trying to implement it now.
> I talked with Xiaoguang wrt. zero-copy support, and looks it isn't ready
> as one generic approach. If it is ready, it is easy to integrate to UBD.
>
>> 5) Currently, UBD only support the loop target with io_uirng and all
>> works(1.get one cqe 2.issue target io_uring IO 3.get target io_uring IO
>> completion 4.prepare one sqe) are done in one thread. As far as I know,
> loop is one example, and it provides similar function with kernel loop by
> < 200 lines of userspace code.
>
>>  some applications such as SPDK, network fs and customized distribution
>> systems do not support io_uring well.  I think we should separate target
>> IO handling from the UBDSRV loop and allow applications handle target
>> IOs themselves. Is this suggestion reasonable? (Or UBD should focus on
>> io_uring-supported targets?)
> UBD provides one framework for implementing userspace block driver, you
> can do everything for handling the IO in userspace. The target code just
> needs to implement callbacks defined in ubdsrv_tgt_type, so it has been
> separated from ubd loop already. But UBD is still in early stage,
> and the interface will continue to improve or re-design. Or can you
> explain your ideas in a bit details? It could be very helpful if you
> can provide some application background.
>
> Reason why I suggested to use io_uring is that io_uring is very efficient, also
> async IO has been proved as very efficient approach for handling io.
>
>
> Thanks
> Ming