Re: [LSF/MM/BPF TOPIC] ublk & io_uring: ublk zero copy support

Bernd Schubert <bschubert@xxxxxxx> · Fri, 5 May 2023 21:57:47 +0000

Hi Ming,

On 4/29/23 04:18, Ming Lei wrote:
> Hello,
> 
> ublk zero copy is observed to improve big chunk(64KB+) sequential IO performance a
> lot, such as, IOPS of ublk-loop over tmpfs is increased by 1~2X[1], Jens also observed
> that IOPS of ublk-qcow2 can be increased by ~1X[2]. Meantime it saves memory bandwidth.
> 
> So this is one important performance improvement.
> 
> So far there are three proposal:

looks like there is no dedicated session. Could we still have a 
discussion in a free slot, if possible?

Thanks,
Bernd

> 
> 1) splice based
> 
> - spliced page from ->splice_read() can't be written
> 
> ublk READ request can't be handled because spliced page can't be written
> to, and extending splice for ublk zero copy isn't one good solution[3]
> 
> - it is very hard to meet above requirements  wrt. request buffer lifetime
> 
> splice/pipe focuses on page reference lifetime, but ublk zero copy pays more
> attention to ublk request buffer lifetime. If is very inefficient to respect
> request buffer lifetime by using all pipe buffer's ->release() which requires
> all pipe buffers and pipe to be kept when ublk server handles IO. That means
> one single dedicated ``pipe_inode_info`` has to be allocated runtime for each
> provided buffer, and the pipe needs to be populated with pages in ublk request
> buffer.
> 
> IMO, it isn't one good way to take splice from both correctness and performance
> viewpoint.
> 
> 2) io_uring register buffer based
> 
> - the main idea is to register one runtime buffer in fast io path, and
>    unregister it after the buffer is used by the following OPs
> 
> - the main problem is that bad performance caused by io_uring link model
> 
> registering buffer has to be one OP, same with unregistering buffer; the
> following normal OPs(such as FS IO) have to depend on the registering
> buffer OP, then io_uring link has to be used.
> 
> It is normal to see more than one normal OPs which depend on the registering
> buffer OP, so all these OPs(registering buffer, normal (FS IO) OPs and
> unregistering buffer) have to be linked together, then normal(FS IO) OPs
> have to be submitted one by one, and this way is slow, because there is
> often no dependency among all these normal FS OPs. Basically io_uring
> link model does not support this kind of 1:N dependency.
> 
> No one posted code for showing this approach yet.
> 
> 3) io_uring fused command[1]
> 
> - fused command extend current io_uring usage by allowing submitting following
> FS OPs(called secondary OPs) after the primary command provides buffer, and
> primary command won't be completed until all secondary OPs are done.
> 
> This way solves the problem in 2), and meantime avoids the buffer register cost in
> both submission and completion IO fast code path because the primary command won't
> be completed until all secondary OPs are done, so no need to write/read the
> buffer into per-context global data structure.
> 
> Meantime buffer lifetime problem is addressed simply, so correctness gets guaranteed,
> and performance is pretty good, and even IOPS of 4k IO gets a little
> improved in some workloads, or at least no perf regression is observed
> for small size IO.
> 
> fused command can be thought as one single request logically, just it has more
> than one SQE(all share same link flag), that is why is named as fused command.
> 
> - the only concern is that fused command starts one use usage of io_uring, but
> still not see comments wrt. what/why is bad with this kind of new usage/interface.
> 
> I propose this topic and want to discuss about how to move on with this
> feature.
> 
> 
> [1] https://lore.kernel.org/linux-block/20230330113630.1388860-1-ming.lei@xxxxxxxxxx/
> [2] https://lore.kernel.org/linux-block/b3fc9991-4c53-9218-a8cc-5b4dd3952108@xxxxxxxxx/
> [3] https://lore.kernel.org/linux-block/CAHk-=wgJsi7t7YYpuo6ewXGnHz2nmj67iWR6KPGoz5TBu34mWQ@xxxxxxxxxxxxxx/
> 
> 
> Thanks,
> Ming
>