On Mon, Jan 27, 2020 at 02:36:08PM +0000, Mel Gorman wrote: > Commit 8ab39f11d974 ("xfs: prevent CIL push holdoff in log > recovery") changed from using bound workqueues to using unbound > workqueues. Functionally this makes sense but it was observed at the time > that the dbench performance dropped quite a lot and CPU migrations were > excessively high even when there are plenty of idle CPUs. Hmmm - that just made the CIL workqueue WQ_UNBOUND. Not a complex change... > The pattern of the task migration is straight-forward. With XFS, an IO > issuer may delegate work to a kworker which wakes on the same CPU. On > completion of the work, it wakes the task, finds that the previous CPU > is busy (because the kworker is still running on it) and migrates the > task to the next idle CPU. The task ends up migrating around all CPUs > sharing a LLC at high frequency. This has negative implications both in > commication costs and power management. mpstat confirmed that at low > thread counts that all CPUs sharing an LLC has low level of activity. Very interesting, Mel. :) I suspect this appears is a very similar issue that is discussed in this thread about workqueues and AIO completion latencies: https://lore.kernel.org/lkml/20191114113153.GB4213@ming.t460p/ The problem is described here, along with comments about how fundamental this behaviour is to the correct functioning of filesystems: https://lore.kernel.org/lkml/20191121221051.GG4614@xxxxxxxxxxxxxxxxxxx/ There are several patches thrown about during the discussion, initially focussed on wakeup pre-emption to run the work immediately until I pointed out that was the wrong thing to do for work being deferred to workqueues. After a some more proposed patches the discussion on the scheduler side of things largely ground to a halt and so has not been fixed. So I'm initially wondering if this solves that problem, too, or whether you are seeing a slightly different manifestation of that same scheduler issue.... > The impact of this problem is related to the number of CPUs sharing an LLC. > > This patch special cases the pattern and allows a kworker waker and a > task wakee to stack on the same CPU if there is a strong chance they are > directly related. The expectation is that the kworker is likely going > back to sleep shortly. This is not guaranteed as the IO could be queued > asynchronously but there is a very strong relationship between the task and > kworker in this case that would justify stacking on the same CPU instead > of migrating. There should be few concerns about kworker starvation given > that the special casing is only when the kworker is the waker. > DBench on XFS [snip positive dbench results] Yeah, dbench does lots of synchronous operations that end up waiting on journal flushes (data integrity operations) so it would trip over kworker scheduling latency issues. FWIW, I didn't see any perf degradation on my machines from the commit you quoted, but I also had a real hard time replication the aio completion latency problem on them as well. Hence I don't think they are particularly susceptible to bad migration decisions, so I'm not surprised I didn't see this. > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index fe4e0d775375..76df439aff76 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5912,6 +5912,19 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > (available_idle_cpu(prev) || sched_idle_cpu(prev))) > return prev; > > + /* > + * Allow a per-cpu kthread to stack with the wakee if the > + * kworker thread and the tasks previous CPU are the same. > + * The assumption is that the wakee queued work for the > + * per-cpu kthread that is now complete and the wakeup is > + * essentially a sync wakeup. > + */ > + if (is_per_cpu_kthread(current) && > + prev == smp_processor_id() && > + this_rq()->nr_running <= 1) { > + return prev; > + } Ok, so if I've read this correctly, this special case only triggers when scheduling from the per-cpu kworker thread context, and only if there is one other runnable task on the queue? So it special cases the ping-pong case so that the non-bound task that scheduled the kworker also remains scheduled this CPU? Hmmmm. When we set up a workqueue as WQ_UNBOUND on a numa system, isn't the worker pool set up as a node-bound task? i.e. it's not a per-cpu kthread anymore, but a task limited by the cpumask of that node? That isn't a per-CPU kthread anymore, is it? That is: > @@ -2479,3 +2479,16 @@ static inline void membarrier_switch_mm(struct rq *rq, > { > } > #endif > + > +#ifdef CONFIG_SMP > +static inline bool is_per_cpu_kthread(struct task_struct *p) > +{ > + if (!(p->flags & PF_KTHREAD)) > + return false; > + > + if (p->nr_cpus_allowed != 1) > + return false; p->nr_cpus_allowed is no longer 1 but the number of CPUs in the per-node cpu mask it is allowed to run on? And so if that is the case, then the change in commit 8ab39f11d974 which set WQ_UNBOUND on the XFS CIL workqueue would mean the above logic change should not be triggering for the CIL worker because it is no longer a CPU bound kthread.... What am I missing here? <light bulb illumination> Is this actually ping-ponging the CIL flush and journal IO completion because xlog_bio_end_io() always punts journal IO completion to the log workqueue, which is: log->l_ioend_workqueue = alloc_workqueue("xfs-log/%s", WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_HIGHPRI, 0, mp->m_super->s_id); i.e. it uses per-cpu kthreads for processing journal IO completion similar to DIO io completion and thereby provides a vector for the same issue? A typical fsync is processed like this: user task CIL kworker IO completion kworker xfs_trans_commit() pushes on log <blocks waiting on flush completion> <wake> formats vectors loop { wait for iclog space <block waiting on write completion> <wake> journal IO completion frees up iclog space wakes write waiter <done> <wake> write vectors into iclog submit iclog IO } <done> <wake> journal IO completion frees up iclog space wakes flush waiter <done> <wakes> <transaction commit done> i.e. before commit 8ab39f11d974 we have: user task = unbound XFS CIL commit worker = CPU bound kthread XFS journal IO completion = CPU bound kthread And because the the CIL kworker and IO completion kworker are bound to the same CPU they don't trigger migrations as they can't be moved anywhere else. And so it doesn't matter how many times we switch between them to complete a CIL flush because they will not trigger migrations. However, after commit 8ab39f11d974 we have: user task = unbound XFS CIL commit worker = unbound kthread XFS journal IO completion = CPU bound kthread IOWs, I think we now have exactly the same situation as discussed in the thread I pointed you to above, where an unbound task work (the CIL kworker) is trying to run on the same CPU as the CPU bound IO completion kworker, and that causes the CIL kworker to be migrated to a different CPU on each bounced throught the "wait for iclog space" loop. Hence your new logic is actually triggering on the journal IO completion kworker threads, not the CIL kworker threads. After all this, I have two questions that would help me understand if this is what you are seeing: 1. to confirm: does removing just the WQ_UNBOUND from the CIL push workqueue (as added in 8ab39f11d974) make the regression go away? 2. when the problem shows up, which tasks are actually being migrated excessively - is it the user task, the CIL kworker task or something else? Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx