On 3/30/23 13:36, Ming Lei wrote: > Apply io_uring fused command for supporting zero copy: > > 1) init the fused cmd buffer(io_mapped_buf) in ublk_map_io(), and deinit it > in ublk_unmap_io(), and this buffer is immutable, so it is just fine to > retrieve it from concurrent fused command. > > 1) add sub-command opcode of UBLK_IO_FUSED_SUBMIT_IO for retrieving this > fused cmd(zero copy) buffer > > 2) call io_fused_cmd_start_secondary_req() to provide buffer to secondary > request and submit secondary request; meantime setup complete callback via > this API, once secondary request is completed, the complete callback is > called for freeing the buffer and completing the fused command > > Also request reference is held during fused command lifetime, and this way > guarantees that request buffer won't be freed until all inflight fused > commands are completed. > > userspace(only implement sqe128 fused command): > > https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v6 > > liburing test(only implement normal sqe fused command: two 64byte SQEs) > > https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v6 > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > --- > Documentation/block/ublk.rst | 126 ++++++++++++++++++++-- > drivers/block/ublk_drv.c | 192 ++++++++++++++++++++++++++++++++-- > include/uapi/linux/ublk_cmd.h | 6 +- > 3 files changed, 303 insertions(+), 21 deletions(-) > > diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst > index 1713b2890abb..7b7aa24e9729 100644 > --- a/Documentation/block/ublk.rst > +++ b/Documentation/block/ublk.rst > @@ -297,18 +297,126 @@ with specified IO tag in the command data: > ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy > the server buffer (pages) read to the IO request pages. > > -Future development > -================== > +- ``UBLK_IO_FUSED_SUBMIT_IO`` > + > + Used for implementing zero copy feature. > + > + It has to been the primary command of io_uring fused command. This command > + submits the generic secondary IO request with io buffer provided by our primary > + command, and won't be completed until the secondary request is done. > + > + The provided buffer is represented as ``io_uring_bvec_buf``, which is > + actually ublk request buffer's reference, and the reference is shared & > + read-only, so the generic secondary request can retrieve any part of the buffer > + by passing buffer offset & length. > > Zero copy > ---------- > +========= > + > +What is zero copy? > +------------------ > + > +When application submits IO to ``/dev/ublkb*``, userspace buffer(direct io) > +or page cache buffer(buffered io) or kernel buffer(meta io often) is used > +for submitting data to ublk driver, and all kinds of these buffers are > +represented by bio/bvecs(ublk request buffer) finally. Before supporting > +zero copy, data in these buffers has to be copied to ublk server userspace > +buffer before handling WRITE IO, or after handing READ IO, so that ublk > +server can handle IO for ``/dev/ublkb*`` with the copied data. > + > +The extra copy between ublk request buffer and ublk server userspace buffer > +not only increases CPU utilization(such as pinning pages, copy data), but > +also consumes memory bandwidth, and the cost could be very big when IO size > +is big. It is observed that ublk-null IOPS may be increased to ~5X if the > +extra copy can be avoided. > + > +So zero copy is very important for supporting high performance block device > +in userspace. > + > +Technical requirements > +---------------------- > + > +- ublk request buffer use > + > +ublk request buffer is represented by bio/bvec, which is immutable, so do > +not try to change bvec via buffer reference; data can be read from or > +written to the buffer according to buffer direction, but bvec can't be > +changed > + > +- buffer lifetime > + > +Ublk server borrows ublk request buffer for handling ublk IO, ublk request > +buffer reference is used. Reference can't outlive the referent buffer. That > +means all request buffer references have to be released by ublk server > +before ublk driver completes this request, when request buffer ownership > +is transferred to upper layer(FS, application, ...). > + > +Also after ublk request is completed, any page belonging to this ublk > +request can not be written or read any more from ublk server since it is > +one block device from kernel viewpoint. > + > +- buffer direction > + > +For ublk WRITE request, ublk request buffer should only be accessed as data > +source, and the buffer can't be written by ublk server > + > +For ublk READ request, ublk request buffer should only be accessed as data > +destination, and the buffer can't be read by ublk server, otherwise kernel > +data is leaked to ublk server, which can be unprivileged application. > + > +- arbitrary size sub-buffer needs to be retrieved from ublk server > + > +ublk is one generic framework for implementing block device in userspace, > +and typical requirements include logical volume manager(mirror, stripped, ...), > +distributed network storage, compressed target, ... > + > +ublk server needs to retrieve arbitrary size sub-buffer of ublk request, and > +ublk server needs to submit IOs with these sub-buffer(s). That also means > +arbitrary size sub-buffer(s) can be used to submit IO multiple times. > + > +Any sub-buffer is actually one reference of ublk request buffer, which > +ownership can't be transferred to upper layer if any reference is held > +by ublk server. > + > +Why slice isn't good for ublk zero copy > +--------------------------------------- > + > +- spliced page from ->splice_read() can't be written > + > +ublk READ request can't be handled because spliced page can't be written to, and > +extending splice for ublk zero copy isn't one good solution [#splice_extend]_ > + > +- it is very hard to meet above requirements wrt. request buffer lifetime > + > +splice/pipe focuses on page reference lifetime, but ublk zero copy pays more > +attention to ublk request buffer lifetime. If is very inefficient to respect > +request buffer lifetime by using all pipe buffer's ->release() which requires > +all pipe buffers and pipe to be kept when ublk server handles IO. That means > +one single dedicated ``pipe_inode_info`` has to be allocated runtime for each > +provided buffer, and the pipe needs to be populated with pages in ublk request > +buffer. > + > + > +io_uring fused command based zero copy > +-------------------------------------- > + > +io_uring fused command includes one primary command(uring command) and one > +generic secondary request. The primary command is responsible for submitting > +secondary request with provided buffer from ublk request, and primary command > +won't be completed until the secondary request is completed. > + > +Typical ublk IO handling includes network and FS IO, so it is usual enough > +for io_uring net & fs to support IO with provided buffer from primary command. > > -Zero copy is a generic requirement for nbd, fuse or similar drivers. A > -problem [#xiaoguang]_ Xiaoguang mentioned is that pages mapped to userspace > -can't be remapped any more in kernel with existing mm interfaces. This can > -occurs when destining direct IO to ``/dev/ublkb*``. Also, he reported that > -big requests (IO size >= 256 KB) may benefit a lot from zero copy. > +Once primary command is submitted successfully, ublk driver guarantees that > +the ublk request buffer won't be gone away since secondary request actually > +grabs the buffer's reference. This way also guarantees that multiple > +concurrent fused commands associated with same request buffer works fine, > +as the provided buffer reference is shared & read-only. > > +Also buffer usage direction flag is passed to primary command from userspace, > +so ublk driver can validate if it is legal to use buffer with requested > +direction. > > References > ========== > @@ -323,4 +431,4 @@ References > > .. [#stefan] https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/ > > -.. [#xiaoguang] https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/ > +.. [#splice_extend] https://lore.kernel.org/linux-block/CAHk-=wgJsi7t7YYpuo6ewXGnHz2nmj67iWR6KPGoz5TBu34mWQ@xxxxxxxxxxxxxx/ > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c > index a744d3b5da91..64b0408873f6 100644 > --- a/drivers/block/ublk_drv.c > +++ b/drivers/block/ublk_drv.c > @@ -74,10 +74,15 @@ struct ublk_rq_data { > * successfully > */ > struct kref ref; > + bool allocated_bvec; > + struct io_uring_bvec_buf buf[0]; > }; > > struct ublk_uring_cmd_pdu { > - struct ublk_queue *ubq; > + union { > + struct ublk_queue *ubq; > + struct request *req; > + }; > }; > > /* > @@ -565,6 +570,69 @@ static size_t ublk_copy_user_pages(const struct request *req, > return done; > } > > +/* > + * The built command buffer is immutable, so it is fine to feed it to > + * concurrent io_uring fused commands > + */ > +static int ublk_init_zero_copy_buffer(struct request *rq) > +{ > + struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq); > + struct io_uring_bvec_buf *imu = data->buf; > + struct req_iterator rq_iter; > + unsigned int nr_bvecs = 0; > + struct bio_vec *bvec; > + unsigned int offset; > + struct bio_vec bv; > + > + if (!ublk_rq_has_data(rq)) > + goto exit; > + > + rq_for_each_bvec(bv, rq, rq_iter) > + nr_bvecs++; > + > + if (!nr_bvecs) > + goto exit; > + > + if (rq->bio != rq->biotail) { > + int idx = 0; > + > + bvec = kvmalloc_array(nr_bvecs, sizeof(struct bio_vec), > + GFP_NOIO); > + if (!bvec) > + return -ENOMEM; > + > + offset = 0; > + rq_for_each_bvec(bv, rq, rq_iter) > + bvec[idx++] = bv; > + data->allocated_bvec = true; > + } else { > + struct bio *bio = rq->bio; > + > + offset = bio->bi_iter.bi_bvec_done; > + bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter); > + } > + imu->bvec = bvec; > + imu->nr_bvecs = nr_bvecs; > + imu->offset = offset; > + imu->len = blk_rq_bytes(rq); > + > + return 0; > +exit: > + imu->bvec = NULL; > + return 0; > +} > + > +static void ublk_deinit_zero_copy_buffer(struct request *rq) > +{ > + struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq); > + struct io_uring_bvec_buf *imu = data->buf; > + > + if (data->allocated_bvec) { > + kvfree(imu->bvec); > + data->allocated_bvec = false; > + } > +} > + > static inline bool ublk_need_map_req(const struct request *req) > { > return ublk_rq_has_data(req) && req_op(req) == REQ_OP_WRITE; > @@ -575,11 +643,23 @@ static inline bool ublk_need_unmap_req(const struct request *req) > return ublk_rq_has_data(req) && req_op(req) == REQ_OP_READ; > } > > -static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req, > +static int ublk_map_io(const struct ublk_queue *ubq, struct request *req, > struct ublk_io *io) > { > const unsigned int rq_bytes = blk_rq_bytes(req); > > + if (ublk_support_zc(ubq)) { > + int ret = ublk_init_zero_copy_buffer(req); > + > + /* > + * The only failure is -ENOMEM for allocating fused cmd > + * buffer, return zero so that we can requeue this req. > + */ > + if (unlikely(ret)) > + return 0; > + return rq_bytes; > + } > + > /* > * no zero copy, we delay copy WRITE request data into ublksrv > * context and the big benefit is that pinning pages in current > @@ -599,11 +679,17 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req, > } > > static int ublk_unmap_io(const struct ublk_queue *ubq, > - const struct request *req, > + struct request *req, > struct ublk_io *io) > { > const unsigned int rq_bytes = blk_rq_bytes(req); > > + if (ublk_support_zc(ubq)) { > + ublk_deinit_zero_copy_buffer(req); > + > + return rq_bytes; > + } > + > if (ublk_need_unmap_req(req)) { > struct iov_iter iter; > struct iovec iov; > @@ -687,6 +773,12 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu( > return (struct ublk_uring_cmd_pdu *)&ioucmd->pdu; > } > > +static inline struct ublk_uring_cmd_pdu *ublk_get_uring_fused_cmd_pdu( > + struct io_uring_cmd *ioucmd) > +{ > + return (struct ublk_uring_cmd_pdu *)&ioucmd->fused.pdu; > +} > + > static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq) > { > return ubq->ubq_daemon->flags & PF_EXITING; > @@ -742,6 +834,7 @@ static inline void __ublk_complete_rq(struct request *req) > > return; > exit: > + ublk_deinit_zero_copy_buffer(req); > blk_mq_end_request(req, res); > } > > @@ -1352,6 +1445,68 @@ static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub, > return NULL; > } > > +static void ublk_fused_cmd_done_cb(struct io_uring_cmd *cmd, > + unsigned issue_flags) > +{ > + struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_fused_cmd_pdu(cmd); > + struct request *req = pdu->req; > + struct ublk_queue *ubq = req->mq_hctx->driver_data; > + > + ublk_put_req_ref(ubq, req); > + io_uring_cmd_done(cmd, 0, 0, issue_flags); > +} > + > +static inline bool ublk_check_fused_buf_dir(const struct request *req, > + unsigned int flags) > +{ > + flags &= IO_URING_F_FUSED_BUF; ~IO_URING_F_FUSED_BUF ?