Re: [PATCH RFC v2 16/19] fuse: {uring} Wake requests on the the current cpu

Bernd Schubert <bernd.schubert@xxxxxxxxxxx> · Thu, 30 May 2024 18:59:27 +0200

On 5/30/24 18:44, Shachar Sharon wrote:
> On Wed, May 29, 2024 at 10:36 PM Bernd Schubert <bschubert@xxxxxxx> wrote:
>>
>> Most of the performance improvements
>> with fuse-over-io-uring for synchronous requests is the possibility
>> to run processing on the submitting cpu core and to also wake
>> the submitting process on the same core - switching between
>> cpu cores.
>>
>> Signed-off-by: Bernd Schubert <bschubert@xxxxxxx>
>> ---
>>  fs/fuse/dev.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index c7fd3849a105..851c5fa99946 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -333,7 +333,10 @@ void fuse_request_end(struct fuse_req *req)
>>                 spin_unlock(&fc->bg_lock);
>>         } else {
>>                 /* Wake up waiter sleeping in request_wait_answer() */
>> -               wake_up(&req->waitq);
>> +               if (fuse_per_core_queue(fc))
>> +                       __wake_up_on_current_cpu(&req->waitq, TASK_NORMAL, NULL);
>> +               else
>> +                       wake_up(&req->waitq);
> 
> Would it be possible to apply this idea for regular FUSE connection?

I probably should have written it in the commit message, without uring
performance is the same or slightly worse. With direct-IO reads

jobs    /dev/fuse         /dev/fuse
        (migrate off)     (migrate on)
1           2023             1652
2           3375   	     2805
4           3823             4193
8           7796             8161
16          8520             8518
24          8361             8084
32          8717             8342

(in MB/s).

I think there is no improvement as daemon threads process requests on
random cores. I.e. request processing doesn't happen on the same core
a request was submitted to.

> What would happen if some (buggy or malicious) userspace FUSE server uses
> sched_setaffinity(2) to run only on a subset of active CPUs?

The request goes to the ring, which cpu it eventually handles should not
matter. Performance will not be optimal then.
That being said, the introduction mail points out an issue with xfstest
generic/650,
which disables/enables CPUs in a loop - I need to investigate what
happens there.

Thanks,
Bernd