On 10/14/24 19:48, Pavel Begunkov wrote: > On 10/14/24 16:21, Bernd Schubert wrote: >> On 10/14/24 15:34, Pavel Begunkov wrote: >>> On 10/14/24 13:47, Bernd Schubert wrote: >>>> On 10/14/24 13:10, Miklos Szeredi wrote: >>>>> On Mon, 14 Oct 2024 at 04:44, Ming Lei <tom.leiming@xxxxxxxxx> wrote: >>>>> >>>>>> It also depends on how fuse user code consumes the big CQE >>>>>> payload, if >>>>>> fuse header needs to keep in memory a bit long, you may have to >>>>>> copy it >>>>>> somewhere for post-processing since io_uring(kernel) needs CQE to be >>>>>> returned back asap. >>>>> >>>>> Yes. >>>>> >>>>> I'm not quite sure how the libfuse interface will work to accommodate >>>>> this. Currently if the server needs to delay the processing of a >>>>> request it would have to copy all arguments, since validity will not >>>>> be guaranteed after the callback returns. With the io_uring >>>>> infrastructure the headers would need to be copied, but the data >>>>> buffer would be per-request and would not need copying. This is >>>>> relaxing a requirement so existing servers would continue to work >>>>> fine, but would not be able to take full advantage of the multi-buffer >>>>> design. >>>>> >>>>> Bernd do you have an idea how this would work? >>>> >>>> I assume returning a CQE is io_uring_cq_advance()? >>> >>> Yes >>> >>>> In my current libfuse io_uring branch that only happens when >>>> all CQEs have been processed. We could also easily switch to >>>> io_uring_cqe_seen() to do it per CQE. >>> >>> Either that one. >>> >>>> I don't understand why we need to return CQEs asap, assuming CQ >>>> ring size is the same as SQ ring size - why does it matter? >>> >>> The SQE is consumed once the request is issued, but nothing >>> prevents the user to keep the QD larger than the SQ size, >>> e.g. do M syscalls each ending N requests and then wait for > > typo, Sending or queueing N requests. In other words it's > perfectly legal to: > > It's perfectly legal to: > > ring = create_ring(nr_cqes=N); > for (i = 0 .. M) { > for (i = 0..N) > prep_sqe(); > submit_all_sqes(); > } > wait(nr=N * M); > > > With a caveat that the wait can't complete more than the > CQ size, but you can even add a loop atop of the wait. > > while (nr_inflight_cqes) { > wait(nr = min(CQ_size, nr_inflight_cqes); > process_cqes(); > } > > Or do something more elaborate, often frameworks allow > to push any number of requests not caring too much about > exactly matching queue sizes apart from sizing them for > performance reasons. > >>> N * M completions. >>> >> >> I need a bit help to understand this. Do you mean that in typical >> io-uring usage SQEs get submitted, already released in kernel > > Typical or not, but the number of requests in flight is not > limited by the size of the SQ, it only limits how many > requests you can queue per syscall, i.e. per io_uring_submit(). > > >> and then users submit even more SQEs? And that creates a >> kernel queue depth for completion? >> I guess as long as libfuse does not expose the ring we don't have >> that issue. But then yeah, exposing the ring to fuse-server/daemon >> is planned... > > Could be, for example you don't need to care about overflows > at all if the CQ size is always larger than the number of > requests in flight. Perhaps the simplest example: > > prep_requests(nr=N); > wait_cq(nr=N); > process_cqes(nr=N); With only libfuse as ring user it is more like prep_requests(nr=N); wait_cq(1); ==> we must not wait for more than 1 as more might never arrive io_uring_for_each_cqe { } I still think no issue with libfuse (or any other fuse lib) as single ring-user, but if the same ring then gets used for more all of that might come up. @Miklos maybe we avoid using large CQEs/SQEs and instead set up our own separate buffer for FUSE headers? Thanks, Bernd