Re: [PATCH V3 0/2] ublk: add support for UBLK_IO_NEED_GET_DATA

Ming Lei <ming.lei@xxxxxxxxxx> · Thu, 28 Jul 2022 18:51:13 +0800

On Thu, Jul 28, 2022 at 05:31:22PM +0800, ZiyangZhang wrote:
> 1. Introduction:
> UBLK_IO_NEED_GET_DATA is a new ublk IO command. It is designed for a user
> application who wants to allocate IO buffer and set IO buffer address
> only after it receives an IO request from ublksrv. This is a reasonable
> scenario because these users may use a RPC framework as one IO backend
> to handle IO requests passed from ublksrv. And a RPC framework may
> allocate its own buffer(or memory pool).
> 
> This new feature (UBLK_F_NEED_GET_DATA) is optional for ublk users.
> Related userspace code has been added in ublksrv[1] as one pull request.
> 
> We have add some test cases in ublksrv and all of them pass. The
> performance result shows that this new feature does bring additional
> latency because one IO is issued back to ublk_drv once again to copy data
> from bio vectors to user-provided data buffer.
> 
> 2. Background:
> For now, ublk requires the user to set IO buffer address in advance(with
> last UBLK_IO_COMMIT_AND_FETCH_REQ command)so the user has to
> pre-allocate IO buffer.
> 
> For READ requests, this work flow looks good because the data copy
> happens after user application gets a cqe and the kernel copies data.
> So user application can allocate IO buffer, copy data to be read into
> it, and issues a sqe with the newly allocated IO buffer.
> 
> However, for WRITE requests, this work flow looks weird because
> the data copy happens in a task_work before the user application gets one
> cqe. So it is inconvenient for users who allocates(or fetch from a
> memory pool)buffer after it gets one request(and know the actual data
> size). For these users, they have to memcpy from ublksrv's pre-allocated
> buffer to their internal buffer(such as RPC buffer). We think this
> additional memcpy could be a bottleneck and it is avoidable.
> 
> 2. Design:
> Consider add a new feature flag: UBLK_F_NEED_GET_DATA.
> 
> If user sets this new flag(through libublksrv) and pass it to kernel
> driver, ublk kernel driver should returns a cqe with
> UBLK_IO_RES_NEED_GET_DATA after a new blk-mq WRITE request comes.
> 
> A user application now can allocate data buffer for writing and pass its
> address in UBLK_IO_NEED_GET_DATA command after ublk kernel driver returns
> cqe with UBLK_IO_RES_NEED_GET_DATA.
> 
> After the kernel side gets the sqe (UBLK_IO_NEED_GET_DATA command), it
> now copies(address pinned, indeed) data to be written from bio vectors
> to newly returned IO data buffer.
> 
> Finally, the kernel side returns UBLK_IO_RES_OK and ublksrv handles the
> IO request as usual.The new feature: UBLK_F_NEED_GET_DATA is enabled on
> demand ublksrv still can pre-allocate data buffers with task_work.
> 
> 3. Evaluation:
> Related userspace code and tests have been added in ublksrv[1] as one
> pull request. We evaluate performance based on this PR.
> 
> We have tested write latency with:
>   (1)  No UBLK_F_NEED_GET_DATA(the old commit) as baseline
>   (2)  UBLK_F_NEED_GET_DATA enabled/disabled
> on demo_null and demo_event of newest ublksrv project.
> 
> Config of fio:bs=4k, iodepth=1, numjobs=1, rw=write/randwrite, direct=1,
> ioengine=libaio.
> 
> Here is the comparison of lat(usec) in fio:
> 
> demo_null:
> write:        28.74(baseline) -- 28.77(disable) -- 57.20(enable)
> randwrite:    27.81(baseline) -- 28.51(disable) -- 54.81(enable)
> 
> demo_event:
> write:        46.45(baseline) -- 43.31(disable) -- 75.50(enable)
> randwrite:    45.39(baseline) -- 43.66(disable) -- 76.02(enable)

The data is interesting, and I guess the enable latency data could become
much less if 64 iodepth & 16 batch is used
    --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16

> 
> Looks like:
>   (1) UBLK_F_NEED_GET_DATA does not introduce additional overhead when
>       comparing baseline and disable.
>   (2) enabling UBLK_F_NEED_GET_DATA adds about two times more latency
>       than disabling it. And it is reasonable since the IO goes through
>       the total ublk IO stack(ubd_drv <--> ublksrv) once again.
>   (3) demo_null and demo_event are both null targets. And I think this
>       overhead is not too heavy if real data handling backend is used.
> 
> Without UBLK_IO_NEED_GET_DATA, an additional memcpy(from pre-allocated
> ublksrv buffer to user's buffer) is necessary for a WRITE request.
> However, UBLK_IO_NEED_GET_DATA does bring addtional syscall
> (io_uring_enter). To prove the value of UBLK_IO_NEED_GET_DATA, we test
> the single IO latency (usec) of demo_null with:
>   (1) UBLK_F_NEED_GET_DATA disabled; additional memcpy
>   (2) UBLK_F_NEED_GET_DATA enabled
> 
> Config of fio:iodepth=1, numjobs=1, rw=randwrite, direct=1,
> ioengine=libaio.
> 
> For block size, we choose 4k/64k/128k/256k/512k/1m. Note that with 1m block
> size, the original IO request will be split into two blk-mq requests.
> 
> Here is the comparison of lat(usec) in fio:
> 
>                  2 memcpy, w/o NEED_GET_DATA     1 memcpy, w/ NEED_GET_DATA
> 4k-randwrite:               9.65                            10.06
> 64k-randwrite:              15.19                           13.38
> 128k-randwrite:             19.47                           17.77
> 256k-randwrite:             32.63                           25.33
> 512k-randwrite:             90.57                           46.08
> 1m-randwrite:               177.06                          117.26
> 
> We find that with bigger block size, cases with one memcpy w/ NEED_GET_DATA
> result in lower latency than cases with two memcpy w/o NEED_GET_DATA.
> Therefore, we think NEED_GET_DATA is suitable for bigger block size,
> such as 512B or 1MB.

With 64 iodepth and submit/completion batching, I think NEED_GET_DATA
could become less.

Anyway, it is very helpful to share the test data, nice job!

Thanks,
Ming