Re: [RFC 0/3] Add io_uring & ebpf based methods to implement zero-copy for ublk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2023/2/15 08:41, Xiaoguang Wang wrote:
> Normally, userspace block device impementations need to copy data between
> kernel block layer's io requests and userspace block device's userspace
> daemon, for example, ublk and tcmu both have similar logic, but this
> operation will consume cpu resources obviously, especially for large io.
> 
> There are methods trying to reduce these cpu overheads, then userspace
> block device's io performance will be improved further. These methods
> contain: 1) use special hardware to do memory copy, but seems not all
> architectures have these special hardware; 2) sofeware methods, such as
> mmap kernel block layer's io requests's data to userspace daemon [1],
> but it has page table's map/unmap, tlb flush overhead, security issue,
> etc, and it maybe only friendly to large io.
> 
> Add a new program type BPF_PROG_TYPE_UBLK for ublk, which is a generic
> framework for implementing block device logic from userspace. Typical
> userspace block device impementations need to copy data between kernel
> block layer's io requests and userspace block device's userspace daemon,
> which will consume cpu resources, especially for large io.
> 
> To solve this problem, I'd propose a new method, which will combine the
> respective advantages of io_uring and ebpf. Add a new program type
> BPF_PROG_TYPE_UBLK for ublk, userspace block device daemon process should
> register an ebpf prog. This bpf prog will use bpf helper offered by ublk
> bpf prog type to submit io requests on behalf of daemon process.
> Currently there is only one helper:
>     u64 bpf_ublk_queue_sqe(struct ublk_io_bpf_ctx *bpf_ctx,
> 		struct io_uring_sqe *sqe, u32 sqe_len, u32, fd)
> 
> This helper will use io_uring to submit io requests, so we need to make
> io_uring be able to submit a sqe located in kernel(Some codes idea comes
> from Pavel's patchset [2], but pavel's patch needs sqe->buf still comes
> from userspace addr), and bpf prog initializes sqes, but does not need to
> initializes sqes' buf field, sqe->buf will come from kernel block layer io
> requests in some form. See patch 2 for more.
> 
> In example of ublk loop target, we can easily implement such below logic in
> ebpf prog:
>   1. userspace daemon registers an ebpf prog and passes two backend file
> fd in ebpf map structure。
>   2. For kernel io requests against the first half of userspace device,
> ebpf prog prepares an io_uring sqe, which will submit io against the first
> backend file fd and sqe's buffer comes from kernel io reqeusts. Kernel
> io requests against second half of userspace device has similar logic,
> only sqe's fd will be the second backend file fd.
>   3. When ublk driver blk-mq queue_rq() is called, this ebpf prog will
> be executed and completes kernel io requests.
> 
> That means, by using ebpf, we can implement various userspace log in kernel.
> 
> From above expample, we can see that this method has 3 advantages at least:
>   1. Remove memory copy between kernel block layer and userspace daemon
> completely.
>   2. Save memory. Userspace daemon doesn't need to maintain memory to
> issue and complete io requests, and use kernel block layer io requests
> memory directly.
>   2. We may reduce the number of round trips between kernel and userspace
> daemon, so may reduce kernel & userspace context switch overheads.
> 
> Test:
> Add a ublk loop target: ublk add -t loop -q 1 -d 128 -f loop.file
> 
> fio job file:
>   [global]
>   direct=1
>   filename=/dev/ublkb0
>   time_based
>   runtime=60
>   numjobs=1
>   cpus_allowed=1
>   
>   [rand-read-4k]
>   bs=512K
>   iodepth=16
>   ioengine=libaio
>   rw=randwrite
>   stonewall
> 
> 
> Without this patch:
>   WRITE: bw=745MiB/s (781MB/s), 745MiB/s-745MiB/s (781MB/s-781MB/s), io=43.6GiB (46.8GB), run=60010-60010msec
>   ublk daemon's cpu utilization is about 9.3%~10.0%, showed by top tool.
> 
> With this patch:
>   WRITE: bw=744MiB/s (781MB/s), 744MiB/s-744MiB/s (781MB/s-781MB/s), io=43.6GiB (46.8GB), run=60012-60012msec
>   ublk daemon's cpu utilization is about 1.3%~1.7%, showed by top tool.
> 
> From above tests, this method can reduce cpu copy overhead obviously.
> 
> 
> TODO:
> I must say this patchset is just a RFC for design.
> 
> 1) Currently for this patchset, I just make ublk ebpf prog submit io requests
> using io_uring in kernel, cqe event still needs to be handled in userspace
> daemon. Once later we succeed in make io_uring handle cqe in kernel, ublk
> ebpf prog can implement io in kernel.
> 
> 2) ublk driver needs to work better with ebpf, currently I did some hack
> codes to support ebpf in ublk driver, it only can support write requests.
> 
> 3) I have not done much tests yet, will run liburing/ublk/blktests
> later.
> 
> Any review and suggestions are welcome, thanks.
> 
> [1] https://lore.kernel.org/all/20220318095531.15479-1-xiaoguang.wang@xxxxxxxxxxxxxxxxx/
> [2] https://lore.kernel.org/all/cover.1621424513.git.asml.silence@xxxxxxxxx/
> 
> 
> Xiaoguang Wang (3):
>   bpf: add UBLK program type
>   io_uring: enable io_uring to submit sqes located in kernel
>   ublk_drv: add ebpf support
> 
>  drivers/block/ublk_drv.c       | 228 ++++++++++++++++++++++++++++++++-
>  include/linux/bpf_types.h      |   2 +
>  include/linux/io_uring.h       |  13 ++
>  include/linux/io_uring_types.h |   8 +-
>  include/uapi/linux/bpf.h       |   2 +
>  include/uapi/linux/ublk_cmd.h  |  11 ++
>  io_uring/io_uring.c            |  59 ++++++++-
>  io_uring/rsrc.c                |  15 +++
>  io_uring/rsrc.h                |   3 +
>  io_uring/rw.c                  |   7 +
>  kernel/bpf/syscall.c           |   1 +
>  kernel/bpf/verifier.c          |   9 +-
>  scripts/bpf_doc.py             |   4 +
>  tools/include/uapi/linux/bpf.h |   9 ++
>  tools/lib/bpf/libbpf.c         |   2 +
>  15 files changed, 366 insertions(+), 7 deletions(-)
> 

Hi, Here is perf report output of ublk daemon(loop target):


+   57.96%     4.03%  ublk           liburing.so.2.2                                [.] _io_uring_get_cqe                    ▒
+   53.94%     0.00%  ublk           [kernel.vmlinux]                               [k] entry_SYSCALL_64                     ◆
+   53.94%     0.65%  ublk           [kernel.vmlinux]                               [k] do_syscall_64                        ▒
+   48.37%     1.18%  ublk           [kernel.vmlinux]                               [k] __do_sys_io_uring_enter              ▒
+   42.92%     1.72%  ublk           [kernel.vmlinux]                               [k] io_cqring_wait                       ▒
+   35.17%     0.06%  ublk           [kernel.vmlinux]                               [k] task_work_run                        ▒
+   34.75%     0.53%  ublk           [kernel.vmlinux]                               [k] io_run_task_work_sig                 ▒
+   33.45%     0.00%  ublk           [kernel.vmlinux]                               [k] ublk_bpf_io_submit_fn                ▒
+   33.16%     0.06%  ublk           bpf_prog_3bdc6181a3c616fb_ublk_io_submit_prog  [k] bpf_prog_3bdc6181a3c616fb_ublk_io_sub▒
+   32.68%     0.00%  iou-wrk-18583  [unknown]                                      [k] 0000000000000000                     ▒
+   32.68%     0.00%  iou-wrk-18583  [unknown]                                      [k] 0x00007efe920b1040                   ▒
+   32.68%     0.00%  iou-wrk-18583  [kernel.vmlinux]                               [k] ret_from_fork                        ▒
+   32.68%     0.47%  iou-wrk-18583  [kernel.vmlinux]                               [k] io_wqe_worker                        ▒
+   30.61%     0.00%  ublk           [kernel.vmlinux]                               [k] io_submit_sqe                        ▒
+   30.31%     0.06%  ublk           [kernel.vmlinux]                               [k] io_issue_sqe                         ▒
+   28.00%     0.00%  ublk           [kernel.vmlinux]                               [k] bpf_ublk_queue_sqe                   ▒
+   28.00%     0.00%  ublk           [kernel.vmlinux]                               [k] io_uring_submit_sqe                  ▒
+   27.18%     0.00%  ublk           [kernel.vmlinux]                               [k] io_write                             ▒
+   27.18%     0.00%  ublk           [xfs]                                          [k] xfs_file_write_iter

The call stack is:

-   57.96%     4.03%  ublk           liburing.so.2.2                                [.] _io_uring_get_cqe                    ◆
   - 53.94% _io_uring_get_cqe                                                                                                ▒
        entry_SYSCALL_64                                                                                                     ▒
      - do_syscall_64                                                                                                        ▒
         - 48.37% __do_sys_io_uring_enter                                                                                    ▒
            - 42.92% io_cqring_wait                                                                                          ▒
               - 34.75% io_run_task_work_sig                                                                                 ▒
                  - task_work_run                                                                                            ▒
                     - 32.50% ublk_bpf_io_submit_fn                                                                          ▒
                        - 32.21% bpf_prog_3bdc6181a3c616fb_ublk_io_submit_prog                                               ▒
                           - 27.12% bpf_ublk_queue_sqe                                                                       ▒
                              - io_uring_submit_sqe                                                                          ▒
                                 - 26.64% io_submit_sqe                                                                      ▒
                                    - 26.35% io_issue_sqe                                                                    ▒
                                       - io_write                                                                            ▒
                                         xfs_file_write_iter                                                                 ▒

Here, "io_submit" ebpf prog will be run in task_work of ublk daemon
process after io_uring_enter() syscall. In this ebpf prog, a sqe is
built and submitted. All information about this blk-mq request is
stored in a "ctx". Then io_uring can write to the backing file
(xfs_file_write_iter).

Here is call stack from perf report output of fio:

-    5.04%     0.18%  fio      [kernel.vmlinux]                             [k] ublk_queue_rq                                ▒
   - 4.86% ublk_queue_rq                                                                                                     ▒
      - 3.67% bpf_prog_b8456549dbe40c37_ublk_io_prep_prog                                                                    ▒
         - 3.10% bpf_trace_printk                                                                                            ▒
              2.83% _raw_spin_unlock_irqrestore                                                                              ▒
      - 0.70% task_work_add                                                                                                  ▒
         - try_to_wake_up                                                                                                    ▒
              _raw_spin_unlock_irqrestore                                                                                    ▒

Here, "io_prep" ebpf prog will be run in "ublk_queue_rq" process.
In this ebpf prog, qid, tag, nr_sectors, start_sector, op, flags
will be stored in one "ctx". Then we add a task_work to the ublk
daemon process.

Regards,
Zhang



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux