Re: Large CQE for fuse headers

Ming Lei <tom.leiming@xxxxxxxxx> · Thu, 17 Oct 2024 08:59:44 +0800

On Wed, Oct 16, 2024 at 01:53:00PM +0200, Bernd Schubert wrote:
> 
> 
> On 10/16/24 12:54, Miklos Szeredi wrote:
> > On Mon, 14 Oct 2024 at 23:27, Bernd Schubert <bernd.schubert@xxxxxxxxxxx> wrote:
> > 
> >> With only libfuse as ring user it is more like
> >>
> >> prep_requests(nr=N);
> >> wait_cq(1); ==> we must not wait for more than 1 as more might never arrive
> >> io_uring_for_each_cqe {
> >> }
> > 
> > Right.
> > 
> > I think the point Pavel is trying to make is that  io_uring queue
> > sizes don't have to match fuse queue size.  So we could have
> > sq_entries=4, cq_entries=4 and have the server queue 64
> > FUSE_URING_REQ_FETCH commands, it just has to do that in batches of 4
> > max.
> 
> Hmm ok, I guess that might matter when payload is small compared to 
> SQ/CQ size and the system is low in memory.
> 
> > 
> >> @Miklos maybe we avoid using large CQEs/SQEs and instead set up our own
> >> separate buffer for FUSE headers?
> > 
> > The only gain from this would be in the case where the uring is used
> > for non-fuse requests as well, in which case the extra space in the
> > queue entries would be unused (i.e. 48 unused bytes in the cacheline).
> > I don't know if this is a realistic use case or not.  It's definitely
> > a challenge to create a library API that allows this.
> > 
> > The disadvantage would be a more complex interface.
> 
> I don't think that complicated. In the end it is just another pointer
> that needs to be mapped. We don't even need to use mmap.
> At least for zero-copy we will need to the ring non-fuse requests. 
> For the DDN use case, we are using another io-uring for tcp requests,
> I would actually like to switch that to the same ring.

I remember the biggest trouble of using same ring in ublk could be exporting
the ring for API users, but it is often per-task, seems not too hard to deal
with.

The pros is you needn't use eventfd to communicate with fuse command
uring(thread) any more, and more uring IOs can be handled in single batch.
Performance is better, with less task switch involved, without extra
communication.

Thanks,
Ming