Ingo, Peter, I would like to ask how to wake up from a waitq on the same core. I have tried __wake_up_sync()/WF_SYNC, but I do not see any effect. I'm currently working on fuse/uring communication patches, besides uring communication there is also a queue per core. Basic bonnie++ benchmarks with a zero file size to just create/read(0)/delete show a ~3x IOPs difference between CPU bound bonnie++ and unbound - i.e. with these patches it _not_ fuse-daemon that needs to be bound, but the application doing IO to the file system. We basically have bonnie -> vfs (app/vfs) fuse_req (app/fuse.ko) qid = task_cpu(current) (app/fuse.ko) ring(qid) / SQE completion (fuse.ko) (app/fuse.ko/uring) wait_event(req->waitq, ...) (app/fuse.ko) [app wait] daemon ring / handle CQE (daemon) send-back result as SQE (daemon/uring) fuse_request_end (daemon/uring/fuse.ko) wake_up() ---> random core (daemon/uring/fuse.ko) [app wakeup/fuse/vfs/syscall return] bonnie ==> different core 1) bound [root@imesrv1 ~]# numactl --localalloc --physcpubind=0 bonnie++ -q -x 1 -s0 -d /scratch/dest/ -n 20:1:1:20 -r 0 -u 0 | bon_csv2txt ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP imesrv1 20:1:1:20 6229 28 11289 41 12785 24 6615 28 7769 40 10020 25 Latency 411us 824us 816us 298us 10473us 200ms 2) not bound [root@imesrv1 ~]# bonnie++ -q -x 1 -s0 -d /scratch/dest/ -n 20:1:1:20 -r 0 -u 0 | bon_csv2txt ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP imesrv1 20:1:1:20 2064 33 2923 43 4556 28 2061 33 2186 42 4245 30 Latency 850us 3914us 2496us 738us 758us 6469us With less files the difference becomes a bit smaller, but is still very visible. Besides cache line bouncing, I'm sure that CPU frequency and C-states will matter - I could tune that it in the lab, but in the end I want to test what users do (I had recently checked with large HPC center - Forschungszentrum Juelich - their HPC compute nodes are not tuned up, to save energy). Also, in order to really tune down latencies, I want want to add a struct file_operations::uring_cmd_iopoll thread, which will spin for a short time and avoid most of kernel/userspace communication. If applications (with n-nthreads < n-cores) then get scheduled on different core differnent rings will be used, result in n-threads-spinning > n-threads-application There was already a related thread about fuse before https://lore.kernel.org/lkml/1638780405-38026-1-git-send-email-quic_pragalla@xxxxxxxxxxx/ With the fuse-uring patches that part is basically solved - the waitq that that thread is about is not used anymore. But as per above, remaining is the waitq of the incoming workq (not mentioned in the thread above). As I wrote, I have tried __wake_up_sync((x), TASK_NORMAL), but it does not make a difference for me - similar to Miklos' testing before. I have also tried struct completion / swait - does not make a difference either. I can see task_struct has wake_cpu, but there doesn't seem to be a good interface to set it. Any ideas? Thanks, Bernd