On 10/14/24 15:34, Pavel Begunkov wrote: > On 10/14/24 13:47, Bernd Schubert wrote: >> On 10/14/24 13:10, Miklos Szeredi wrote: >>> On Mon, 14 Oct 2024 at 04:44, Ming Lei <tom.leiming@xxxxxxxxx> wrote: >>> >>>> It also depends on how fuse user code consumes the big CQE payload, if >>>> fuse header needs to keep in memory a bit long, you may have to copy it >>>> somewhere for post-processing since io_uring(kernel) needs CQE to be >>>> returned back asap. >>> >>> Yes. >>> >>> I'm not quite sure how the libfuse interface will work to accommodate >>> this. Currently if the server needs to delay the processing of a >>> request it would have to copy all arguments, since validity will not >>> be guaranteed after the callback returns. With the io_uring >>> infrastructure the headers would need to be copied, but the data >>> buffer would be per-request and would not need copying. This is >>> relaxing a requirement so existing servers would continue to work >>> fine, but would not be able to take full advantage of the multi-buffer >>> design. >>> >>> Bernd do you have an idea how this would work? >> >> I assume returning a CQE is io_uring_cq_advance()? > > Yes > >> In my current libfuse io_uring branch that only happens when >> all CQEs have been processed. We could also easily switch to >> io_uring_cqe_seen() to do it per CQE. > > Either that one. > >> I don't understand why we need to return CQEs asap, assuming CQ >> ring size is the same as SQ ring size - why does it matter? > > The SQE is consumed once the request is issued, but nothing > prevents the user to keep the QD larger than the SQ size, > e.g. do M syscalls each ending N requests and then wait for > N * M completions. > I need a bit help to understand this. Do you mean that in typical io-uring usage SQEs get submitted, already released in kernel and then users submit even more SQEs? And that creates a kernel queue depth for completion? I guess as long as libfuse does not expose the ring we don't have that issue. But then yeah, exposing the ring to fuse-server/daemon is planned... >> If we indeed need to return the CQE before processing the request, >> it indeed would be better to have a 2nd memory buffer associated with >> the fuse request. > > With that said, the usual problem is to size the CQ so that it > (almost) never overflows, otherwise it hurts performance. With > DEFER_TASKRUN you can delay returning CQEs to the kernel until > the next time you wait for completions, i.e. do io_uring waiting > syscall. Without the flag, CQEs may come asynchronously to the > user, so need a bit more consideration. > Current libfuse code has it disabled IORING_SETUP_SINGLE_ISSUER, IORING_SETUP_DEFER_TASKRUN, IORING_SETUP_TASKRUN_FLAG and IORING_SETUP_COOP_TASKRUN as these are somehow slowing down things. Not sure if this thread is optimal to discuss this. I would also first like to sort out all the other design topics before going into fine-tuning... Thanks, Bernd