Normally, userspace block device impementations need to copy data between kernel block layer's io requests and userspace block device's userspace daemon, for example, ublk and tcmu both have similar logic, but this operation will consume cpu resources obviously, especially for large io. There are methods trying to reduce these cpu overheads, then userspace block device's io performance will be improved further. These methods contain: 1) use special hardware to do memory copy, but seems not all architectures have these special hardware; 2) sofeware methods, such as mmap kernel block layer's io requests's data to userspace daemon [1], but it has page table's map/unmap, tlb flush overhead, security issue, etc, and it maybe only friendly to large io. Add a new program type BPF_PROG_TYPE_UBLK for ublk, which is a generic framework for implementing block device logic from userspace. Typical userspace block device impementations need to copy data between kernel block layer's io requests and userspace block device's userspace daemon, which will consume cpu resources, especially for large io. To solve this problem, I'd propose a new method, which will combine the respective advantages of io_uring and ebpf. Add a new program type BPF_PROG_TYPE_UBLK for ublk, userspace block device daemon process should register an ebpf prog. This bpf prog will use bpf helper offered by ublk bpf prog type to submit io requests on behalf of daemon process. Currently there is only one helper: u64 bpf_ublk_queue_sqe(struct ublk_io_bpf_ctx *bpf_ctx, struct io_uring_sqe *sqe, u32 sqe_len, u32, fd) This helper will use io_uring to submit io requests, so we need to make io_uring be able to submit a sqe located in kernel(Some codes idea comes from Pavel's patchset [2], but pavel's patch needs sqe->buf still comes from userspace addr), and bpf prog initializes sqes, but does not need to initializes sqes' buf field, sqe->buf will come from kernel block layer io requests in some form. See patch 2 for more. In example of ublk loop target, we can easily implement such below logic in ebpf prog: 1. userspace daemon registers an ebpf prog and passes two backend file fd in ebpf map structure。 2. For kernel io requests against the first half of userspace device, ebpf prog prepares an io_uring sqe, which will submit io against the first backend file fd and sqe's buffer comes from kernel io reqeusts. Kernel io requests against second half of userspace device has similar logic, only sqe's fd will be the second backend file fd. 3. When ublk driver blk-mq queue_rq() is called, this ebpf prog will be executed and completes kernel io requests. That means, by using ebpf, we can implement various userspace log in kernel. >From above expample, we can see that this method has 3 advantages at least: 1. Remove memory copy between kernel block layer and userspace daemon completely. 2. Save memory. Userspace daemon doesn't need to maintain memory to issue and complete io requests, and use kernel block layer io requests memory directly. 2. We may reduce the number of round trips between kernel and userspace daemon, so may reduce kernel & userspace context switch overheads. Test: Add a ublk loop target: ublk add -t loop -q 1 -d 128 -f loop.file fio job file: [global] direct=1 filename=/dev/ublkb0 time_based runtime=60 numjobs=1 cpus_allowed=1 [rand-read-4k] bs=512K iodepth=16 ioengine=libaio rw=randwrite stonewall Without this patch: WRITE: bw=745MiB/s (781MB/s), 745MiB/s-745MiB/s (781MB/s-781MB/s), io=43.6GiB (46.8GB), run=60010-60010msec ublk daemon's cpu utilization is about 9.3%~10.0%, showed by top tool. With this patch: WRITE: bw=744MiB/s (781MB/s), 744MiB/s-744MiB/s (781MB/s-781MB/s), io=43.6GiB (46.8GB), run=60012-60012msec ublk daemon's cpu utilization is about 1.3%~1.7%, showed by top tool. >From above tests, this method can reduce cpu copy overhead obviously. TODO: I must say this patchset is just a RFC for design. 1) Currently for this patchset, I just make ublk ebpf prog submit io requests using io_uring in kernel, cqe event still needs to be handled in userspace daemon. Once later we succeed in make io_uring handle cqe in kernel, ublk ebpf prog can implement io in kernel. 2) ublk driver needs to work better with ebpf, currently I did some hack codes to support ebpf in ublk driver, it only can support write requests. 3) I have not done much tests yet, will run liburing/ublk/blktests later. Any review and suggestions are welcome, thanks. [1] https://lore.kernel.org/all/20220318095531.15479-1-xiaoguang.wang@xxxxxxxxxxxxxxxxx/ [2] https://lore.kernel.org/all/cover.1621424513.git.asml.silence@xxxxxxxxx/ Xiaoguang Wang (3): bpf: add UBLK program type io_uring: enable io_uring to submit sqes located in kernel ublk_drv: add ebpf support drivers/block/ublk_drv.c | 228 ++++++++++++++++++++++++++++++++- include/linux/bpf_types.h | 2 + include/linux/io_uring.h | 13 ++ include/linux/io_uring_types.h | 8 +- include/uapi/linux/bpf.h | 2 + include/uapi/linux/ublk_cmd.h | 11 ++ io_uring/io_uring.c | 59 ++++++++- io_uring/rsrc.c | 15 +++ io_uring/rsrc.h | 3 + io_uring/rw.c | 7 + kernel/bpf/syscall.c | 1 + kernel/bpf/verifier.c | 9 +- scripts/bpf_doc.py | 4 + tools/include/uapi/linux/bpf.h | 9 ++ tools/lib/bpf/libbpf.c | 2 + 15 files changed, 366 insertions(+), 7 deletions(-) -- 2.31.1