Re: Large CQE for fuse headers

Bernd Schubert <bernd.schubert@xxxxxxxxxxx> · Mon, 14 Oct 2024 23:27:04 +0200

On 10/14/24 19:48, Pavel Begunkov wrote:
> On 10/14/24 16:21, Bernd Schubert wrote:
>> On 10/14/24 15:34, Pavel Begunkov wrote:
>>> On 10/14/24 13:47, Bernd Schubert wrote:
>>>> On 10/14/24 13:10, Miklos Szeredi wrote:
>>>>> On Mon, 14 Oct 2024 at 04:44, Ming Lei <tom.leiming@xxxxxxxxx> wrote:
>>>>>
>>>>>> It also depends on how fuse user code consumes the big CQE
>>>>>> payload, if
>>>>>> fuse header needs to keep in memory a bit long, you may have to
>>>>>> copy it
>>>>>> somewhere for post-processing since io_uring(kernel) needs CQE to be
>>>>>> returned back asap.
>>>>>
>>>>> Yes.
>>>>>
>>>>> I'm not quite sure how the libfuse interface will work to accommodate
>>>>> this.  Currently if the server needs to delay the processing of a
>>>>> request it would have to copy all arguments, since validity will not
>>>>> be guaranteed after the callback returns.  With the io_uring
>>>>> infrastructure the headers would need to be copied, but the data
>>>>> buffer would be per-request and would not need copying.  This is
>>>>> relaxing a requirement so existing servers would continue to work
>>>>> fine, but would not be able to take full advantage of the multi-buffer
>>>>> design.
>>>>>
>>>>> Bernd do you have an idea how this would work?
>>>>
>>>> I assume returning a CQE is io_uring_cq_advance()?
>>>
>>> Yes
>>>
>>>> In my current libfuse io_uring branch that only happens when
>>>> all CQEs have been processed. We could also easily switch to
>>>> io_uring_cqe_seen() to do it per CQE.
>>>
>>> Either that one.
>>>
>>>> I don't understand why we need to return CQEs asap, assuming CQ
>>>> ring size is the same as SQ ring size - why does it matter?
>>>
>>> The SQE is consumed once the request is issued, but nothing
>>> prevents the user to keep the QD larger than the SQ size,
>>> e.g. do M syscalls each ending N requests and then wait for
> 
> typo, Sending or queueing N requests. In other words it's
> perfectly legal to:
> 
> It's perfectly legal to:
> 
> ring = create_ring(nr_cqes=N);
> for (i = 0 .. M) {
>     for (i = 0..N)
>         prep_sqe();
>     submit_all_sqes();
> }
> wait(nr=N * M);
> 
> 
> With a caveat that the wait can't complete more than the
> CQ size, but you can even add a loop atop of the wait.
> 
> while (nr_inflight_cqes) {
>     wait(nr = min(CQ_size, nr_inflight_cqes);
>     process_cqes();
> }
> 
> Or do something more elaborate, often frameworks allow
> to push any number of requests not caring too much about
> exactly matching queue sizes apart from sizing them for
> performance reasons.
> 
>>> N * M completions.
>>>
>>
>> I need a bit help to understand this. Do you mean that in typical
>> io-uring usage SQEs get submitted, already released in kernel
> 
> Typical or not, but the number of requests in flight is not
> limited by the size of the SQ, it only limits how many
> requests you can queue per syscall, i.e. per io_uring_submit().
> 
> 
>> and then users submit even more SQEs? And that creates a
>> kernel queue depth for completion?
>> I guess as long as libfuse does not expose the ring we don't have
>> that issue. But then yeah, exposing the ring to fuse-server/daemon
>> is planned...
> 
> Could be, for example you don't need to care about overflows
> at all if the CQ size is always larger than the number of
> requests in flight. Perhaps the simplest example:
> 
> prep_requests(nr=N);
> wait_cq(nr=N);
> process_cqes(nr=N);

With only libfuse as ring user it is more like 

prep_requests(nr=N);
wait_cq(1); ==> we must not wait for more than 1 as more might never arrive
io_uring_for_each_cqe {
}

I still think no issue with libfuse (or any other fuse lib) as single ring-user,
but if the same ring then gets used for more all of that might come up.

@Miklos maybe we avoid using large CQEs/SQEs and instead set up our own
separate buffer for FUSE headers?

Thanks,
Bernd