I'm adding Jan Kara to the cc as he was looking into the workqueue implemention in depth this morning and helped me better understand what is going on. Phil and Ming are still cc'd as an independent test would still be nice. > <light bulb illumination> > > Is this actually ping-ponging the CIL flush and journal IO > completion because xlog_bio_end_io() always punts journal IO > completion to the log workqueue, which is: > > log->l_ioend_workqueue = alloc_workqueue("xfs-log/%s", > WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_HIGHPRI, 0, > mp->m_super->s_id); > > i.e. it uses per-cpu kthreads for processing journal IO completion > similar to DIO io completion and thereby provides a vector for > the same issue? > Your light bulb is on point. The XFS unbound workqueue does run near the task and does not directly cause the migration but the IO completions matter. As it turned out, it was the IO completions I was looking at in the old traces but there was insufficient detail to see the exact sequence. I only observed that a bound wq at the end was causing the migration and it was a consistent pattern. I did a more detailed trace that included workqueue tracepoints. I limited the run to 1 and 2 dbench clients and compared with and without the patch. I still did not dig deep into the specifics of how XFS interacts with workqueues because I'm focused on how the scheduler reacts. The patch is still having an impact with bound workqueues as expected because; # zgrep sched_migrate_task 5.5.0-rc7-tipsched-20200124/iter-0/ftrace-dbench4.gz | wc -l 556259 # zgrep sched_migrate_task 5.5.0-rc7-kworkerstack-v1r2/iter-0/ftrace-dbench4.gz | wc -l 11736 There are still migrations happening but there also was a lot of logging going on for this run so it's not directly comparable what I originally reported. This is an example sequence of what's happening from a scheduler perspective on the vanilla kernel. It's editted a bit because there were a lot of other IOs going on, mostly logging related which confuse the picture. dbench-25633 [004] d... 194.998648: workqueue_queue_work: work struct=000000001cccdc2d function=xlog_cil_push_work [xfs] workqueue=00000000d90239c9 req_cpu=512 cpu=4294967295 dbench-25633 [004] d... 194.998650: sched_waking: comm=kworker/u161:6 pid=718 prio=120 target_cpu=006 dbench-25633 [004] d... 194.998655: sched_wakeup: comm=kworker/u161:6 pid=718 prio=120 target_cpu=006 kworker/u161:6-718 [006] .... 194.998692: workqueue_execute_start: work struct 000000001cccdc2d: function xlog_cil_push_work [xfs] kworker/u161:6-718 [006] .... 194.998706: workqueue_execute_end: work struct 000000001cccdc2d Dbench is on CPU 4, it queues xlog_cil_push_work on an UNBOUND workqueue. An unbound kworker wakes on CPU 6 and finishes quickly. kworker/u161:6-718 [006] .... 194.998707: workqueue_execute_start: work struct 0000000046fbf8d5: function wq_barrier_func kworker/u161:6-718 [006] d... 194.998708: sched_waking: comm=dbench pid=25633 prio=120 target_cpu=004 kworker/u161:6-718 [006] d... 194.998712: sched_wakeup: comm=dbench pid=25633 prio=120 target_cpu=004 kworker/u161:6-718 [006] .... 194.998713: workqueue_execute_end: work struct 0000000046fbf8d5 The kworker wakes dbench and finding that CPU 4 is still free, dbench uses its previous CPU and no migration occurs. dbench-25633 [004] d... 194.998727: workqueue_queue_work: work struct=00000000442434a7 function=blk_mq_requeue_work workqueue=00000000df918933 req_cpu=512 cpu=4 dbench-25633 [004] d... 194.998728: sched_waking: comm=kworker/4:1H pid=991 prio=100 target_cpu=004 dbench-25633 [004] dN.. 194.998731: sched_wakeup: comm=kworker/4:1H pid=991 prio=100 target_cpu=004 Dbench queues blk_mq_requeue_work. This is a BOUND workqueue with a mandatory CPU target of 4 so no migration.. kworker/4:1H-991 [004] .... 194.998736: workqueue_execute_start: work struct 00000000442434a7: function blk_mq_requeue_work kworker/4:1H-991 [004] .... 194.998742: workqueue_execute_end: work struct 00000000442434a7 blk_mq_requeue_work is done <idle>-0 [004] d.s. 194.998859: workqueue_queue_work: work struct=00000000442434a7 function=blk_mq_requeue_work workqueue=00000000df918933 req_cpu=512 cpu=4 <idle>-0 [004] d.s. 194.998861: sched_waking: comm=kworker/4:1H pid=991 prio=100 target_cpu=004 <idle>-0 [004] dNs. 194.998862: sched_wakeup: comm=kworker/4:1H pid=991 prio=100 target_cpu=004 kworker/4:1H-991 [004] .... 194.998866: workqueue_execute_start: work struct 00000000442434a7: function blk_mq_requeue_work kworker/4:1H-991 [004] .... 194.998870: workqueue_execute_end: work struct 00000000442434a7 <idle>-0 [004] d.s. 194.998911: workqueue_queue_work: work struct=0000000072f39adb function=xlog_ioend_work [xfs] workqueue=00000000008f3d7f req_cpu=512 cpu=4 <idle>-0 [004] d.s. 194.998912: sched_waking: comm=kworker/4:1H pid=991 prio=100 target_cpu=004 <idle>-0 [004] dNs. 194.998913: sched_wakeup: comm=kworker/4:1H pid=991 prio=100 target_cpu=004 Ok, this looks like an IRQ delivered for IO completion and the xlog_ioend_work is reached. The BOUND kworker is woken again by the IRQ handler on CPU 4 because it has no choice. kworker/4:1H-991 [004] .... 194.998918: workqueue_execute_start: work struct 0000000072f39adb: function xlog_ioend_work [xfs] kworker/4:1H-991 [004] d... 194.998943: sched_waking: comm=dbench pid=25633 prio=120 target_cpu=004 kworker/4:1H-991 [004] d... 194.998945: sched_migrate_task: comm=dbench pid=25633 prio=120 orig_cpu=4 dest_cpu=5 kworker/4:1H-991 [004] d... 194.998947: sched_wakeup: comm=dbench pid=25633 prio=120 target_cpu=005 kworker/4:1H-991 [004] .... 194.998948: workqueue_execute_end: work struct 0000000072f39adb The IO completion handler finishes, the bound workqueue tries to wake dbench on its old CPU. The BOUND kworker is on CPU 4, the task wants CPU 4 but the CPU is busy with the kworker so the scheduler function select_idle_sibling picks CPU 5 instead and now the task is migrated and we have started our round-trip of all CPUs sharing a LLC. It's not a perfect round-robin because p->recent_used_cpu often works. Looking at the traces, dbench bounces back and forth between CPUs 4 and 5 for 7 IO completions before bouncing between CPUs 6/7 and so on. The patch alters the very last stage. The IO completion is a bound kworker and allows the wakee task to use the same CPU and avoid the migration. -- Mel Gorman SUSE Labs