Hi Ming, On 4/29/23 04:18, Ming Lei wrote: > Hello, > > ublk zero copy is observed to improve big chunk(64KB+) sequential IO performance a > lot, such as, IOPS of ublk-loop over tmpfs is increased by 1~2X[1], Jens also observed > that IOPS of ublk-qcow2 can be increased by ~1X[2]. Meantime it saves memory bandwidth. > > So this is one important performance improvement. > > So far there are three proposal: looks like there is no dedicated session. Could we still have a discussion in a free slot, if possible? Thanks, Bernd > > 1) splice based > > - spliced page from ->splice_read() can't be written > > ublk READ request can't be handled because spliced page can't be written > to, and extending splice for ublk zero copy isn't one good solution[3] > > - it is very hard to meet above requirements wrt. request buffer lifetime > > splice/pipe focuses on page reference lifetime, but ublk zero copy pays more > attention to ublk request buffer lifetime. If is very inefficient to respect > request buffer lifetime by using all pipe buffer's ->release() which requires > all pipe buffers and pipe to be kept when ublk server handles IO. That means > one single dedicated ``pipe_inode_info`` has to be allocated runtime for each > provided buffer, and the pipe needs to be populated with pages in ublk request > buffer. > > IMO, it isn't one good way to take splice from both correctness and performance > viewpoint. > > 2) io_uring register buffer based > > - the main idea is to register one runtime buffer in fast io path, and > unregister it after the buffer is used by the following OPs > > - the main problem is that bad performance caused by io_uring link model > > registering buffer has to be one OP, same with unregistering buffer; the > following normal OPs(such as FS IO) have to depend on the registering > buffer OP, then io_uring link has to be used. > > It is normal to see more than one normal OPs which depend on the registering > buffer OP, so all these OPs(registering buffer, normal (FS IO) OPs and > unregistering buffer) have to be linked together, then normal(FS IO) OPs > have to be submitted one by one, and this way is slow, because there is > often no dependency among all these normal FS OPs. Basically io_uring > link model does not support this kind of 1:N dependency. > > No one posted code for showing this approach yet. > > 3) io_uring fused command[1] > > - fused command extend current io_uring usage by allowing submitting following > FS OPs(called secondary OPs) after the primary command provides buffer, and > primary command won't be completed until all secondary OPs are done. > > This way solves the problem in 2), and meantime avoids the buffer register cost in > both submission and completion IO fast code path because the primary command won't > be completed until all secondary OPs are done, so no need to write/read the > buffer into per-context global data structure. > > Meantime buffer lifetime problem is addressed simply, so correctness gets guaranteed, > and performance is pretty good, and even IOPS of 4k IO gets a little > improved in some workloads, or at least no perf regression is observed > for small size IO. > > fused command can be thought as one single request logically, just it has more > than one SQE(all share same link flag), that is why is named as fused command. > > - the only concern is that fused command starts one use usage of io_uring, but > still not see comments wrt. what/why is bad with this kind of new usage/interface. > > I propose this topic and want to discuss about how to move on with this > feature. > > > [1] https://lore.kernel.org/linux-block/20230330113630.1388860-1-ming.lei@xxxxxxxxxx/ > [2] https://lore.kernel.org/linux-block/b3fc9991-4c53-9218-a8cc-5b4dd3952108@xxxxxxxxx/ > [3] https://lore.kernel.org/linux-block/CAHk-=wgJsi7t7YYpuo6ewXGnHz2nmj67iWR6KPGoz5TBu34mWQ@xxxxxxxxxxxxxx/ > > > Thanks, > Ming >