Re: Large CQE for fuse headers

Pavel Begunkov <asml.silence@xxxxxxxxx> · Mon, 14 Oct 2024 18:48:13 +0100

On 10/14/24 16:21, Bernd Schubert wrote:
On 10/14/24 15:34, Pavel Begunkov wrote:
On 10/14/24 13:47, Bernd Schubert wrote:
On 10/14/24 13:10, Miklos Szeredi wrote:
On Mon, 14 Oct 2024 at 04:44, Ming Lei <tom.leiming@xxxxxxxxx> wrote:

It also depends on how fuse user code consumes the big CQE payload, if
fuse header needs to keep in memory a bit long, you may have to copy it
somewhere for post-processing since io_uring(kernel) needs CQE to be
returned back asap.

Yes.

I'm not quite sure how the libfuse interface will work to accommodate
this.  Currently if the server needs to delay the processing of a
request it would have to copy all arguments, since validity will not
be guaranteed after the callback returns.  With the io_uring
infrastructure the headers would need to be copied, but the data
buffer would be per-request and would not need copying.  This is
relaxing a requirement so existing servers would continue to work
fine, but would not be able to take full advantage of the multi-buffer
design.

Bernd do you have an idea how this would work?

I assume returning a CQE is io_uring_cq_advance()?

Yes

In my current libfuse io_uring branch that only happens when
all CQEs have been processed. We could also easily switch to
io_uring_cqe_seen() to do it per CQE.

Either that one.

I don't understand why we need to return CQEs asap, assuming CQ
ring size is the same as SQ ring size - why does it matter?

The SQE is consumed once the request is issued, but nothing
prevents the user to keep the QD larger than the SQ size,
e.g. do M syscalls each ending N requests and then wait for

typo, Sending or queueing N requests. In other words it's
perfectly legal to:

It's perfectly legal to:

ring = create_ring(nr_cqes=N);
for (i = 0 .. M) {
	for (i = 0..N)
		prep_sqe();
	submit_all_sqes();
}
wait(nr=N * M);

With a caveat that the wait can't complete more than the
CQ size, but you can even add a loop atop of the wait.

while (nr_inflight_cqes) {
	wait(nr = min(CQ_size, nr_inflight_cqes);
	process_cqes();
}

Or do something more elaborate, often frameworks allow
to push any number of requests not caring too much about
exactly matching queue sizes apart from sizing them for
performance reasons.

N * M completions.

I need a bit help to understand this. Do you mean that in typical
io-uring usage SQEs get submitted, already released in kernel

Typical or not, but the number of requests in flight is not
limited by the size of the SQ, it only limits how many
requests you can queue per syscall, i.e. per io_uring_submit().

and then users submit even more SQEs? And that creates a
kernel queue depth for completion?
I guess as long as libfuse does not expose the ring we don't have
that issue. But then yeah, exposing the ring to fuse-server/daemon
is planned...

Could be, for example you don't need to care about overflows
at all if the CQ size is always larger than the number of
requests in flight. Perhaps the simplest example:

prep_requests(nr=N);
wait_cq(nr=N);
process_cqes(nr=N);

If we indeed need to return the CQE before processing the request,
it indeed would be better to have a 2nd memory buffer associated with
the fuse request.

With that said, the usual problem is to size the CQ so that it
(almost) never overflows, otherwise it hurts performance. With
DEFER_TASKRUN you can delay returning CQEs to the kernel until
the next time you wait for completions, i.e. do io_uring waiting
syscall. Without the flag, CQEs may come asynchronously to the
user, so need a bit more consideration.

Current libfuse code has it disabled IORING_SETUP_SINGLE_ISSUER,
IORING_SETUP_DEFER_TASKRUN, IORING_SETUP_TASKRUN_FLAG and
IORING_SETUP_COOP_TASKRUN as these are somehow slowing down
things.

Those flags are not a requirement, you can try to size the
CQ so that overflows are rare, it's just a bit easier to do
with DEFER_TASKRUN.

Not sure if this thread is optimal to discuss this. I would
also first like to sort out all the other design topics before
going into fine-tuning...

--
Pavel Begunkov