On 3/31/23 21:55, Bernd Schubert wrote: > On 3/30/23 13:36, Ming Lei wrote: >> Apply io_uring fused command for supporting zero copy: >> >> 1) init the fused cmd buffer(io_mapped_buf) in ublk_map_io(), and >> deinit it >> in ublk_unmap_io(), and this buffer is immutable, so it is just fine to >> retrieve it from concurrent fused command. >> >> 1) add sub-command opcode of UBLK_IO_FUSED_SUBMIT_IO for retrieving this >> fused cmd(zero copy) buffer >> >> 2) call io_fused_cmd_start_secondary_req() to provide buffer to secondary >> request and submit secondary request; meantime setup complete callback >> via >> this API, once secondary request is completed, the complete callback is >> called for freeing the buffer and completing the fused command >> >> Also request reference is held during fused command lifetime, and this >> way >> guarantees that request buffer won't be freed until all inflight fused >> commands are completed. >> >> userspace(only implement sqe128 fused command): >> >> https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v6 >> >> liburing test(only implement normal sqe fused command: two 64byte SQEs) >> >> https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v6 >> >> Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> >> --- >> Documentation/block/ublk.rst | 126 ++++++++++++++++++++-- >> drivers/block/ublk_drv.c | 192 ++++++++++++++++++++++++++++++++-- >> include/uapi/linux/ublk_cmd.h | 6 +- >> 3 files changed, 303 insertions(+), 21 deletions(-) >> >> diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst >> index 1713b2890abb..7b7aa24e9729 100644 >> --- a/Documentation/block/ublk.rst >> +++ b/Documentation/block/ublk.rst >> @@ -297,18 +297,126 @@ with specified IO tag in the command data: >> ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy >> the server buffer (pages) read to the IO request pages. >> -Future development >> -================== >> +- ``UBLK_IO_FUSED_SUBMIT_IO`` >> + >> + Used for implementing zero copy feature. >> + >> + It has to been the primary command of io_uring fused command. This >> command >> + submits the generic secondary IO request with io buffer provided by >> our primary >> + command, and won't be completed until the secondary request is done. >> + >> + The provided buffer is represented as ``io_uring_bvec_buf``, which is >> + actually ublk request buffer's reference, and the reference is >> shared & >> + read-only, so the generic secondary request can retrieve any part >> of the buffer >> + by passing buffer offset & length. >> Zero copy >> ---------- >> +========= >> + >> +What is zero copy? >> +------------------ >> + >> +When application submits IO to ``/dev/ublkb*``, userspace >> buffer(direct io) >> +or page cache buffer(buffered io) or kernel buffer(meta io often) is >> used >> +for submitting data to ublk driver, and all kinds of these buffers are >> +represented by bio/bvecs(ublk request buffer) finally. Before supporting >> +zero copy, data in these buffers has to be copied to ublk server >> userspace >> +buffer before handling WRITE IO, or after handing READ IO, so that ublk >> +server can handle IO for ``/dev/ublkb*`` with the copied data. >> + >> +The extra copy between ublk request buffer and ublk server userspace >> buffer >> +not only increases CPU utilization(such as pinning pages, copy data), >> but >> +also consumes memory bandwidth, and the cost could be very big when >> IO size >> +is big. It is observed that ublk-null IOPS may be increased to ~5X if >> the >> +extra copy can be avoided. >> + >> +So zero copy is very important for supporting high performance block >> device >> +in userspace. >> + >> +Technical requirements >> +---------------------- >> + >> +- ublk request buffer use >> + >> +ublk request buffer is represented by bio/bvec, which is immutable, >> so do >> +not try to change bvec via buffer reference; data can be read from or >> +written to the buffer according to buffer direction, but bvec can't be >> +changed >> + >> +- buffer lifetime >> + >> +Ublk server borrows ublk request buffer for handling ublk IO, ublk >> request >> +buffer reference is used. Reference can't outlive the referent >> buffer. That >> +means all request buffer references have to be released by ublk server >> +before ublk driver completes this request, when request buffer ownership >> +is transferred to upper layer(FS, application, ...). >> + >> +Also after ublk request is completed, any page belonging to this ublk >> +request can not be written or read any more from ublk server since it is >> +one block device from kernel viewpoint. >> + >> +- buffer direction >> + >> +For ublk WRITE request, ublk request buffer should only be accessed >> as data >> +source, and the buffer can't be written by ublk server >> + >> +For ublk READ request, ublk request buffer should only be accessed as >> data >> +destination, and the buffer can't be read by ublk server, otherwise >> kernel >> +data is leaked to ublk server, which can be unprivileged application. >> + >> +- arbitrary size sub-buffer needs to be retrieved from ublk server >> + >> +ublk is one generic framework for implementing block device in >> userspace, >> +and typical requirements include logical volume manager(mirror, >> stripped, ...), >> +distributed network storage, compressed target, ... >> + >> +ublk server needs to retrieve arbitrary size sub-buffer of ublk >> request, and >> +ublk server needs to submit IOs with these sub-buffer(s). That also >> means >> +arbitrary size sub-buffer(s) can be used to submit IO multiple times. >> + >> +Any sub-buffer is actually one reference of ublk request buffer, which >> +ownership can't be transferred to upper layer if any reference is held >> +by ublk server. >> + >> +Why slice isn't good for ublk zero copy >> +--------------------------------------- >> + >> +- spliced page from ->splice_read() can't be written >> + >> +ublk READ request can't be handled because spliced page can't be >> written to, and >> +extending splice for ublk zero copy isn't one good solution >> [#splice_extend]_ >> + >> +- it is very hard to meet above requirements wrt. request buffer >> lifetime >> + >> +splice/pipe focuses on page reference lifetime, but ublk zero copy >> pays more >> +attention to ublk request buffer lifetime. If is very inefficient to >> respect >> +request buffer lifetime by using all pipe buffer's ->release() which >> requires >> +all pipe buffers and pipe to be kept when ublk server handles IO. >> That means >> +one single dedicated ``pipe_inode_info`` has to be allocated runtime >> for each >> +provided buffer, and the pipe needs to be populated with pages in >> ublk request >> +buffer. >> + >> + >> +io_uring fused command based zero copy >> +-------------------------------------- >> + >> +io_uring fused command includes one primary command(uring command) >> and one >> +generic secondary request. The primary command is responsible for >> submitting >> +secondary request with provided buffer from ublk request, and primary >> command >> +won't be completed until the secondary request is completed. >> + >> +Typical ublk IO handling includes network and FS IO, so it is usual >> enough >> +for io_uring net & fs to support IO with provided buffer from primary >> command. >> -Zero copy is a generic requirement for nbd, fuse or similar drivers. A >> -problem [#xiaoguang]_ Xiaoguang mentioned is that pages mapped to >> userspace >> -can't be remapped any more in kernel with existing mm interfaces. >> This can >> -occurs when destining direct IO to ``/dev/ublkb*``. Also, he reported >> that >> -big requests (IO size >= 256 KB) may benefit a lot from zero copy. >> +Once primary command is submitted successfully, ublk driver >> guarantees that >> +the ublk request buffer won't be gone away since secondary request >> actually >> +grabs the buffer's reference. This way also guarantees that multiple >> +concurrent fused commands associated with same request buffer works >> fine, >> +as the provided buffer reference is shared & read-only. >> +Also buffer usage direction flag is passed to primary command from >> userspace, >> +so ublk driver can validate if it is legal to use buffer with requested >> +direction. >> References >> ========== >> @@ -323,4 +431,4 @@ References >> .. [#stefan] >> https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/ >> -.. [#xiaoguang] >> https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/ >> +.. [#splice_extend] >> https://lore.kernel.org/linux-block/CAHk-=wgJsi7t7YYpuo6ewXGnHz2nmj67iWR6KPGoz5TBu34mWQ@xxxxxxxxxxxxxx/ >> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c >> index a744d3b5da91..64b0408873f6 100644 >> --- a/drivers/block/ublk_drv.c >> +++ b/drivers/block/ublk_drv.c >> @@ -74,10 +74,15 @@ struct ublk_rq_data { >> * successfully >> */ >> struct kref ref; >> + bool allocated_bvec; >> + struct io_uring_bvec_buf buf[0]; >> }; >> struct ublk_uring_cmd_pdu { >> - struct ublk_queue *ubq; >> + union { >> + struct ublk_queue *ubq; >> + struct request *req; >> + }; >> }; >> /* >> @@ -565,6 +570,69 @@ static size_t ublk_copy_user_pages(const struct >> request *req, >> return done; >> } >> +/* >> + * The built command buffer is immutable, so it is fine to feed it to >> + * concurrent io_uring fused commands >> + */ >> +static int ublk_init_zero_copy_buffer(struct request *rq) >> +{ >> + struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq); >> + struct io_uring_bvec_buf *imu = data->buf; >> + struct req_iterator rq_iter; >> + unsigned int nr_bvecs = 0; >> + struct bio_vec *bvec; >> + unsigned int offset; >> + struct bio_vec bv; >> + >> + if (!ublk_rq_has_data(rq)) >> + goto exit; >> + >> + rq_for_each_bvec(bv, rq, rq_iter) >> + nr_bvecs++; >> + >> + if (!nr_bvecs) >> + goto exit; >> + >> + if (rq->bio != rq->biotail) { >> + int idx = 0; >> + >> + bvec = kvmalloc_array(nr_bvecs, sizeof(struct bio_vec), >> + GFP_NOIO); >> + if (!bvec) >> + return -ENOMEM; >> + >> + offset = 0; >> + rq_for_each_bvec(bv, rq, rq_iter) >> + bvec[idx++] = bv; >> + data->allocated_bvec = true; >> + } else { >> + struct bio *bio = rq->bio; >> + >> + offset = bio->bi_iter.bi_bvec_done; >> + bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter); >> + } >> + imu->bvec = bvec; >> + imu->nr_bvecs = nr_bvecs; >> + imu->offset = offset; >> + imu->len = blk_rq_bytes(rq); >> + >> + return 0; >> +exit: >> + imu->bvec = NULL; >> + return 0; >> +} >> + >> +static void ublk_deinit_zero_copy_buffer(struct request *rq) >> +{ >> + struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq); >> + struct io_uring_bvec_buf *imu = data->buf; >> + >> + if (data->allocated_bvec) { >> + kvfree(imu->bvec); >> + data->allocated_bvec = false; >> + } >> +} >> + >> static inline bool ublk_need_map_req(const struct request *req) >> { >> return ublk_rq_has_data(req) && req_op(req) == REQ_OP_WRITE; >> @@ -575,11 +643,23 @@ static inline bool ublk_need_unmap_req(const >> struct request *req) >> return ublk_rq_has_data(req) && req_op(req) == REQ_OP_READ; >> } >> -static int ublk_map_io(const struct ublk_queue *ubq, const struct >> request *req, >> +static int ublk_map_io(const struct ublk_queue *ubq, struct request >> *req, >> struct ublk_io *io) >> { >> const unsigned int rq_bytes = blk_rq_bytes(req); >> + if (ublk_support_zc(ubq)) { >> + int ret = ublk_init_zero_copy_buffer(req); >> + >> + /* >> + * The only failure is -ENOMEM for allocating fused cmd >> + * buffer, return zero so that we can requeue this req. >> + */ >> + if (unlikely(ret)) >> + return 0; >> + return rq_bytes; >> + } >> + >> /* >> * no zero copy, we delay copy WRITE request data into ublksrv >> * context and the big benefit is that pinning pages in current >> @@ -599,11 +679,17 @@ static int ublk_map_io(const struct ublk_queue >> *ubq, const struct request *req, >> } >> static int ublk_unmap_io(const struct ublk_queue *ubq, >> - const struct request *req, >> + struct request *req, >> struct ublk_io *io) >> { >> const unsigned int rq_bytes = blk_rq_bytes(req); >> + if (ublk_support_zc(ubq)) { >> + ublk_deinit_zero_copy_buffer(req); >> + >> + return rq_bytes; >> + } >> + >> if (ublk_need_unmap_req(req)) { >> struct iov_iter iter; >> struct iovec iov; >> @@ -687,6 +773,12 @@ static inline struct ublk_uring_cmd_pdu >> *ublk_get_uring_cmd_pdu( >> return (struct ublk_uring_cmd_pdu *)&ioucmd->pdu; >> } >> +static inline struct ublk_uring_cmd_pdu *ublk_get_uring_fused_cmd_pdu( >> + struct io_uring_cmd *ioucmd) >> +{ >> + return (struct ublk_uring_cmd_pdu *)&ioucmd->fused.pdu; >> +} >> + >> static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq) >> { >> return ubq->ubq_daemon->flags & PF_EXITING; >> @@ -742,6 +834,7 @@ static inline void __ublk_complete_rq(struct >> request *req) >> return; >> exit: >> + ublk_deinit_zero_copy_buffer(req); >> blk_mq_end_request(req, res); >> } >> @@ -1352,6 +1445,68 @@ static inline struct request >> *__ublk_check_and_get_req(struct ublk_device *ub, >> return NULL; >> } >> +static void ublk_fused_cmd_done_cb(struct io_uring_cmd *cmd, >> + unsigned issue_flags) >> +{ >> + struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_fused_cmd_pdu(cmd); >> + struct request *req = pdu->req; >> + struct ublk_queue *ubq = req->mq_hctx->driver_data; >> + >> + ublk_put_req_ref(ubq, req); >> + io_uring_cmd_done(cmd, 0, 0, issue_flags); >> +} >> + >> +static inline bool ublk_check_fused_buf_dir(const struct request *req, >> + unsigned int flags) >> +{ >> + flags &= IO_URING_F_FUSED_BUF; > > ~IO_URING_F_FUSED_BUF ? > Ah sorry for the noise, got confused because the caller ublk_handle_fused_cmd checks for the flag with '&' and then ublk_check_fused_buf_dir checks with '=' - I thought intend was to remove the flag.