On 4/28/23 03:44, Hillf Danton wrote: > On 27 Apr 2023 13:35:31 +0000 Bernd Schubert <bschubert@xxxxxxx> >> Btw, a very hackish way to 'solve' the issue is this >> >> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c >> index cd7aa679c3ee..dd32effb5010 100644 >> --- a/fs/fuse/dev.c >> +++ b/fs/fuse/dev.c >> @@ -373,6 +373,26 @@ static void request_wait_answer(struct fuse_req *req) >> int err; >> int prev_cpu = task_cpu(current); >> >> + /* When running over uring and core affined userspace threads, we >> + * do not want to let migrate away the request submitting process. >> + * Issue is that even after waking up on the right core, processes >> + * that have submitted requests might get migrated away, because >> + * the ring thread is still doing a bit of work or is in the process >> + * to go to sleep. Assumption here is that processes are started on >> + * the right core (i.e. idle cores) and can then stay on that core >> + * when they come and do file system requests. >> + * Another alternative way is to set SCHED_IDLE for ring threads, >> + * but that would have an issue if there are other processes keeping >> + * the cpu busy. >> + * SCHED_IDLE or this hack here result in about factor 3.5 for >> + * max meta request performance. >> + * >> + * Ideal would to tell the scheduler that ring threads are not disturbing >> + * that migration away from it should very very rarely happen. >> + */ >> + if (fc->ring.ready) >> + migrate_disable(); >> + >> if (!fc->no_interrupt) { >> /* Any signal may interrupt this */ >> err = wait_event_interruptible(req->waitq, >> > If I understand it correctly, the seesaw workload hint to scheduler looks > like the diff below, leaving scheduler free to pull the two players apart > across CPU and to migrate anyone. Thank a lot Hillf! I had a day off / family day today, kernel is now eventually compiling. > > --- a/fs/fuse/dev.c > +++ b/fs/fuse/dev.c > @@ -421,6 +421,7 @@ static void __fuse_request_send(struct f > /* acquire extra reference, since request is still needed > after fuse_request_end() */ > __fuse_get_request(req); > + current->seesaw = 1; > queue_request_and_unlock(fiq, req); > > request_wait_answer(req); > @@ -1229,6 +1230,7 @@ static ssize_t fuse_dev_do_read(struct f > fc->max_write)) > return -EINVAL; > > + current->seesaw = 1; fuse_dev_do_read is plain /dev/fuse (with read/write) and we don't know on which core these IO threads are running and which of them to wake up when an application comes with a request. There is a patch to use __wake_up_sync to wake the IO thread and reports that it helps in performance, but I don't see it and I think Miklos neither. For direct-io read I had also already tested disabling migration - it didn't show any effect - we better don't set current->seesaw = 1 in fuse_dev_do_read for now. With my fuse-uring patches things are more clear (https://lwn.net/Articles/926773/), there is one IO thread per core and libfuse side is binding these threads to a single core only. nproc /dev/fuse /dev/fuse fuse uring fuse uring migrate on migrate off migrate on migrate off 1 2023 1652 1151 3998 2 3375 2805 2221 7950 4 3823 4193 4540 15022 8 7796 8161 7846 22591 16 8520 8518 12235 27864 24 8361 8084 9415 27864 32 8361 8084 9124 12971 (in MiB/s) So core affinity really matters and with core affinity it is always faster with fuse-uring over the existing code. For single threaded metadata (file creates/stat/unlink) difference between migrate on/off is rather similar. Going to run with multiple processes during the next days. For paged (async) IO it behaves a bit different as here uring can show it strength and multiple requests can be combined on CQE processing - better to chose and idle ring thread on another core. I actually have a question for that as well - later. > restart: > for (;;) { > spin_lock(&fiq->lock); > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -953,6 +953,7 @@ struct task_struct { > /* delay due to memory thrashing */ > unsigned in_thrashing:1; > #endif > + unsigned seesaw:1; > > unsigned long atomic_flags; /* Flags requiring atomic access. */ > > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7424,6 +7424,8 @@ select_task_rq_fair(struct task_struct * > if (wake_flags & WF_TTWU) { > record_wakee(p); > > + if (p->seesaw && current->seesaw) > + return cpu; > if (sched_energy_enabled()) { > new_cpu = find_energy_efficient_cpu(p, prev_cpu); > if (new_cpu >= 0) Hmm, WF_CURRENT_CPU works rather similar, except that it tests if cpu is in cpus_ptr? The combination of both patches results in if (p->seesaw && current->seesaw) return cpu; if ((wake_flags & WF_CURRENT_CPU) && cpumask_test_cpu(cpu, p->cpus_ptr)) return cpu; While writing the mail kernel compilation is ready, but it got late, will test in the morning. Thanks again, Bernd