ublk document is missed when merging ublk driver, so add it now. Cc: Jonathan Corbet <corbet@xxxxxxx> Cc: Richard W.M. Jones <rjones@xxxxxxxxxx> Cc: ZiyangZhang <ZiyangZhang@xxxxxxxxxxxxxxxxx> Cc: Stefan Hajnoczi <stefanha@xxxxxxxxxx> Cc: Xiaoguang Wang <xiaoguang.wang@xxxxxxxxxxxxxxxxx> Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> --- Documentation/block/ublk.rst | 203 +++++++++++++++++++++++++++++++++++ 1 file changed, 203 insertions(+) create mode 100644 Documentation/block/ublk.rst diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst new file mode 100644 index 000000000000..9e8f7ba518a3 --- /dev/null +++ b/Documentation/block/ublk.rst @@ -0,0 +1,203 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================== +Userspace block device driver(ublk driver) +========================================== + +Overview +======== + +ublk is one generic framework for implementing block device logic from +userspace. It is very helpful to move virtual block drivers into userspace, +such as loop, nbd and similar block drivers. It can help to implement new +virtual block device, such as ublk-qcow2, and there was several attempts +of implementing qcow2 driver in kernel. + +ublk block device(``/dev/ublkb*``) is added by ublk driver. Any IO request +submitted to ublk device will be forwarded to ublk's userspace part( +ublksrv [1]), and after the IO is handled by ublksrv, the result is +committed back to ublk driver, then ublk IO request can be completed. With +this way, any specific IO handling logic is totally done inside ublksrv, +and ublk driver doe _not_ handle any device specific IO logic, such as +loop's IO handling, NBD's IO communication, or qcow2's IO mapping, ... + +/dev/ublkbN is driven by blk-mq request based driver, each request is +assigned by one queue wide unique tag. ublksrv assigns unique tag to each +IO too, which is 1:1 mapped with IO of /dev/ublkb*. + +Both the IO request forward and IO handling result committing are done via +io_uring passthrough command, that is why ublk is also one io_uring based +block driver. It has been observed that io_uring passthrough command can get +better IOPS than block IO. So ublk is one high performance implementation +of userspace block device. Not only IO request communication is done by +io_uring, but also the preferred IO handling in ublksrv is io_uring based +approach too. + +ublk provides control interface to set/get ublk block device parameters, and +the interface is extendable and kabi compatible, so basically any ublk request +queue's parameter or ublk generic feature parameters can be set/get via this +extendable interface. So ublk is generic userspace block device framework, such +as, it is easy to setup one ublk device with specified block parameters from +userspace. + +How to use ublk +=============== + +After building ublksrv[1], ublk block device(``/dev/ublkb*``) can be added +and deleted by the utility, then existed block IO applications can talk with +it. + +See usage details in README[2] of ublksrv, for example of ublk-loop: + +- add ublk device: + ublk add -t loop -f ublk-loop.img + +- use it: + mkfs.xfs /dev/ublkb0 + mount /dev/ublkb0 /mnt + .... # all IOs are handled by io_uring!!! + umount /mnt + +- get ublk dev info: + ublk list + +- delete ublk device + ublk del -a + ublk del -n $ublk_dev_id + +Design +====== + +Control plane +------------- + +ublk driver provides global misc device node(``/dev/ublk-control``) for +managing and controlling ublk devices with help of several control commands: + +- UBLK_CMD_ADD_DEV + Add one ublk char device(``/dev/ublkc*``) which is talked with ublksrv wrt. + IO command communication. Basic device info is sent together with this + command, see UAPI structure of ublksrv_ctrl_dev_info, such as nr_hw_queues, + queue_depth, and max IO request buffer size, which info is negotiated with + ublk driver and sent back to ublksrv. After this command is completed, the + basic device info can't be changed any more. + +- UBLK_CMD_SET_PARAMS / UBLK_CMD_GET_PARAMS + Set or get ublk device's parameters, which can be generic feature related, + or request queue limit related, but can't be IO logic specific, cause ublk + driver does not handle any IO logic. This command has to be sent before + sending UBLK_CMD_START_DEV. + +- UBLK_CMD_START_DEV + After ublksrv prepares userspace resource such as, creating per-queue + pthread & io_ruing for handling ublk IO, this command is set for ublk + driver to allocate & expose /dev/ublkb*. Parameters set via + UBLK_CMD_SET_PARAMS are applied for creating /dev/ublkb*. + +- UBLK_CMD_STOP_DEV + Quiesce IO on /dev/ublkb* and delete the disk. After this command returns, + ublksrv can release resource, such as destroy per-queue pthread & io_uring + for handling io command. + +- UBLK_CMD_DEL_DEV + Delete /dev/ublkc*. After this command returns, the allocated ublk device + number can be reused. + +- UBLK_CMD_GET_QUEUE_AFFINITY + After /dev/ublkc is added, ublk driver creates block layer tagset, so each + queue's affinity info is available, ublksrv sends UBLK_CMD_GET_QUEUE_AFFINITY + to retrieve queue affinity info, so ublksrv can setup the per-queue context + efficiently, such as bind affine CPUs with IO pthread, and try to allocate + buffers in IO thread context. + +- UBLK_CMD_GET_DEV_INFO + For retrieve device info of ublksrv_ctrl_dev_info. And it is ublksrv's + responsibility to save IO target specific info in userspace. + +Data plane +---------- + +ublksrv needs to create per-queue IO pthread & io_uring for handling IO +command (io_uring passthrough command), and the per-queue IO pthread +focuses on IO handling and shouldn't handle any control & management +task. + +ublksrv's IO is assigned by one unique tag, which is 1:1 mapping with IO +request of /dev/ublkb*. + +UAPI structure of ublksrv_io_desc is defined for describing each IO from +ublk driver. One fixed mmaped area(array) on /dev/ublkc* is provided for +exporting IO info to ublksrv, such as IO offset, length, OP/flags and +buffer address. Each ublksrv_io_desc instance can be indexed via queue id +and IO tag directly. + +Following IO commands are communicated via io_uring passthrough command, +and each command is only for forwarding ublk IO and committing IO result +with specified IO tag in the command data: + +- UBLK_IO_FETCH_REQ + Sent from ublksrv IO pthread for fetching future coming IO request + issued to /dev/ublkb*. This command is just sent once from ublksrv IO + pthread for ublk driver to setup IO forward environment. + +- UBLK_IO_COMMIT_AND_FETCH_REQ + After one IO request is issued to /dev/ublkb*, ublk driver stores this + IO's ublksrv_io_desc to the specified mapped area, then the previous + received IO command of this IO tag, either UBLK_IO_FETCH_REQ or + UBLK_IO_COMMIT_AND_FETCH_REQ, is completed, so ulksrv gets the IO + notification via io_uring. + + After ublksrv handles this IO, this IO's result is committed back to ublk + driver by sending UBLK_IO_COMMIT_AND_FETCH_REQ back. Once ublkdrv received + this command, it parses the IO result and complete the IO request to + /dev/ublkb*. Meantime setup environment for fetching future IO request + with this IO tag. So UBLK_IO_COMMIT_AND_FETCH_REQ is reused for both + fetching request and committing back IO result. + +- UBLK_IO_NEED_GET_DATA + ublksrv pre-allocates IO buffer for each IO at default, any new project + should use this IO buffer to communicate with ublk driver. But existed + project may not work or be changed to in this way, so add this command + to provide chance for userspace to use its existed buffer for handling + IO. + +- data copy between ublkserv IO buffer and ublk block IO request + ublk driver needs to copy ublk block IO request pages into ublksrv buffer + (pages) first for WRITE before notifying ublksrv of the coming IO, so + ublksrv can hanldle WRITE request. + + After ublksrv handles READ request and sends UBLK_IO_COMMIT_AND_FETCH_REQ + to ublksrv, ublkdrv needs to copy read ublksrv buffer(pages) to the ublk + IO request pages. + +Future development +================== + +Container-ware ublk deivice +--------------------------- + +ublk driver doesn't handle any IO logic, and its function is well defined +so far, and very limited userspace interfaces are needed, and each one is +well defined too, then it is very likely to make ublk device one +container-ware block device in future, as Stefan Hajnoczi suggested[3], by +removing ADMIN privilege. + +Zero copy +--------- + +Wrt. zero copy support, it is one generic requirement for nbd, fuse or +similar drivers, one problem Xiaoguang mentioned is that pages mapped to +userspace can't be remapped any more in kernel with existed mm interfaces, +and it can be involved when submitting direct IO to /dev/ublkb*. Also +Xiaoguang reported that big request may benefit from zero copy a lot, +such as >= 256KB IO. + + +References +========== + +[1] https://github.com/ming1/ubdsrv + +[2] https://github.com/ming1/ubdsrv/blob/master/README + +[3] https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/ -- 2.31.1