[LSF/MM/BPF TOPIC] ublk & io_uring: ublk zero copy support

Ming Lei <ming.lei@xxxxxxxxxx> · Sat, 29 Apr 2023 10:18:47 +0800

Hello,

ublk zero copy is observed to improve big chunk(64KB+) sequential IO performance a
lot, such as, IOPS of ublk-loop over tmpfs is increased by 1~2X[1], Jens also observed
that IOPS of ublk-qcow2 can be increased by ~1X[2]. Meantime it saves memory bandwidth.

So this is one important performance improvement.

So far there are three proposal:

1) splice based

- spliced page from ->splice_read() can't be written

ublk READ request can't be handled because spliced page can't be written
to, and extending splice for ublk zero copy isn't one good solution[3]

- it is very hard to meet above requirements  wrt. request buffer lifetime

splice/pipe focuses on page reference lifetime, but ublk zero copy pays more
attention to ublk request buffer lifetime. If is very inefficient to respect
request buffer lifetime by using all pipe buffer's ->release() which requires
all pipe buffers and pipe to be kept when ublk server handles IO. That means
one single dedicated ``pipe_inode_info`` has to be allocated runtime for each
provided buffer, and the pipe needs to be populated with pages in ublk request
buffer.

IMO, it isn't one good way to take splice from both correctness and performance
viewpoint.

2) io_uring register buffer based

- the main idea is to register one runtime buffer in fast io path, and
  unregister it after the buffer is used by the following OPs

- the main problem is that bad performance caused by io_uring link model

registering buffer has to be one OP, same with unregistering buffer; the
following normal OPs(such as FS IO) have to depend on the registering
buffer OP, then io_uring link has to be used.

It is normal to see more than one normal OPs which depend on the registering
buffer OP, so all these OPs(registering buffer, normal (FS IO) OPs and
unregistering buffer) have to be linked together, then normal(FS IO) OPs
have to be submitted one by one, and this way is slow, because there is
often no dependency among all these normal FS OPs. Basically io_uring
link model does not support this kind of 1:N dependency.

No one posted code for showing this approach yet.

3) io_uring fused command[1]

- fused command extend current io_uring usage by allowing submitting following
FS OPs(called secondary OPs) after the primary command provides buffer, and
primary command won't be completed until all secondary OPs are done.

This way solves the problem in 2), and meantime avoids the buffer register cost in
both submission and completion IO fast code path because the primary command won't
be completed until all secondary OPs are done, so no need to write/read the
buffer into per-context global data structure.

Meantime buffer lifetime problem is addressed simply, so correctness gets guaranteed,
and performance is pretty good, and even IOPS of 4k IO gets a little
improved in some workloads, or at least no perf regression is observed
for small size IO.

fused command can be thought as one single request logically, just it has more
than one SQE(all share same link flag), that is why is named as fused command.

- the only concern is that fused command starts one use usage of io_uring, but
still not see comments wrt. what/why is bad with this kind of new usage/interface.

I propose this topic and want to discuss about how to move on with this
feature.

[1] https://lore.kernel.org/linux-block/20230330113630.1388860-1-ming.lei@xxxxxxxxxx/
[2] https://lore.kernel.org/linux-block/b3fc9991-4c53-9218-a8cc-5b4dd3952108@xxxxxxxxx/
[3] https://lore.kernel.org/linux-block/CAHk-=wgJsi7t7YYpuo6ewXGnHz2nmj67iWR6KPGoz5TBu34mWQ@xxxxxxxxxxxxxx/

Thanks,
Ming