On Wed, Dec 04, 2019 at 06:29:03PM +0800, Hillf Danton wrote: > > On Wed, 4 Dec 2019 09:29:25 +1100 Dave Chinner wrote: > > > > If we really want to hack around the load balancer problems in this > > way, then we need to add a new workqueue concurrency management type > > with behaviour that lies between the default of bound and WQ_UNBOUND. > > > > WQ_UNBOUND limits scheduling to within a numa node - see > > wq_update_unbound_numa() for how it sets up the cpumask attributes > > it applies to it's workers - but we need the work to be bound to > > within the local cache domain rather than a numa node. IOWs, set up > > the kworker task pool management structure with the right attributes > > (e.g. cpu masks) to define the cache domains, add all the hotplug > > code to make it work with CPU hotplug, then simply apply those > > attributes to the kworker task that is selected to execute the work. > > > > This allows the scheduler to migrate the kworker away from the local > > run queue without interrupting the currently scheduled task. The > > cpumask limits the task is configured with limit the scheduler to > > selecting the best CPU within the local cache domain, and we don't > > have to bind work to CPUs to get CPU cache friendly work scheduling. > > This also avoids overhead of per-queue_work_on() sibling CPU > > calculation, and all the code that wants to use this functionality > > needs to do is add a single flag at work queue init time (e.g. > > WQ_CACHEBOUND). > > > > IOWs, if the task migration behaviour cannot be easily fixed and so > > we need work queue users to be more flexible about work placement, > > then the solution needed here is "cpu cache local work queue > > scheduling" implemented in the work queue infrastructure, not in > > every workqueue user. > > Add WQ_CACHE_BOUND and a user of it and a helper to find cpus that > share cache. <sigh> If you are going to quote my suggestion in full, then please implement it in full. This patch does almost none of what you quoted above - it still has all the problems of the previous version that lead me to write the above. > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -1358,16 +1358,42 @@ static bool is_chained_work(struct workq > return worker && worker->current_pwq->wq == wq; > } > > +static DEFINE_PER_CPU(int, wq_sel_cbc_cnt); > +static DEFINE_PER_CPU(int, wq_sel_cbc_cpu); > +#define WQ_SEL_CBC_BATCH 7 > + > +static int wq_select_cache_bound_cpu(int this_cpu) > +{ > + int *cntp, *cpup; > + int cpu; > + > + cntp = get_cpu_ptr(&wq_sel_cbc_cnt); > + cpup = this_cpu_ptr(&wq_sel_cbc_cpu); > + cpu = *cpup; > + > + if (!(*cntp & WQ_SEL_CBC_BATCH)) { > + cpu = cpus_share_cache_next_cpu(this_cpu, cpu); > + *cpup = cpu; > + } > + (*cntp)++; > + put_cpu_ptr(&wq_sel_cbc_cnt); > + > + return cpu; > +} This selects a specific CPU in the local cache domain at queue_work() time, just like the previous patch. It does not do what I suggested above in reponse to the scalability issues this approach has... > /* > * When queueing an unbound work item to a wq, prefer local CPU if allowed > * by wq_unbound_cpumask. Otherwise, round robin among the allowed ones to > * avoid perturbing sensitive tasks. > */ > -static int wq_select_unbound_cpu(int cpu) > +static int wq_select_unbound_cpu(int cpu, bool cache_bound) > { > static bool printed_dbg_warning; > int new_cpu; > > + if (cache_bound) > + return wq_select_cache_bound_cpu(cpu); > + > if (likely(!wq_debug_force_rr_cpu)) { > if (cpumask_test_cpu(cpu, wq_unbound_cpumask)) > return cpu; > @@ -1417,7 +1443,8 @@ static void __queue_work(int cpu, struct > rcu_read_lock(); > retry: > if (req_cpu == WORK_CPU_UNBOUND) > - cpu = wq_select_unbound_cpu(raw_smp_processor_id()); > + cpu = wq_select_unbound_cpu(raw_smp_processor_id(), > + wq->flags & WQ_CACHE_BOUND); And the per-cpu kworker pool selection after we've selected a CPU here binds it to that specific CPU or (if WQ_UNBOUND) the local node. IOWs, this is exactly the same functionality as the last patch, just moved inside the work queue infrastructure. IOWs, apart from the WQ_CACHE_BOUND flag, this patch doesn't implement any of what I suggested above. It does not solve the the problem of bound kworkers kicking running tasks off a CPU so the bound task can run, and it does not allow the scheduler to select the best CPU in the local cache scheduling domain for the kworker to run on. i.e. it still behaves like bound work rather than WQ_UNBOUND work. IMO, this adds the CPU selection to the -wrong function-. We still want the local CPU selected when req_cpu == WORK_CPU_UNBOUND. The following code selects where the kworker will be bound based on the task pool that the workqueue is configured to use: /* pwq which will be used unless @work is executing elsewhere */ if (!(wq->flags & WQ_UNBOUND)) pwq = per_cpu_ptr(wq->cpu_pwqs, cpu); else pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu)); i.e. the local CPU is the key we need to look up the task pool for running tasks in the local cache domain - we do not use it as the CPU we want to run work on. IOWs, what we really want is this: if (wq->flags & WQ_UNBOUND) pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu)); else if (wq->flags & WQ_CACHE_BOUND) pwq = unbound_pwq_by_cache(wq, cpu); else pwq = per_cpu_ptr(wq->cpu_pwqs, cpu); And then unbound_pwq_by_cache() is implemented in a similar manner to unbound_pwq_by_node() where there is a separate worker pool per cache domain. THe scheduler domain attributes (cpumask) is held by the task pool, and they are applied to the kworker task when it's given the task to run. This, like WQ_UNBOUND, allows the scheduler to select the best CPU in the cpumask for the task to run on. Binding kworkers to a single CPU is exactly the problem we need to avoid here - we need to let the scheduler choose the best CPU in the local cache domain based on the current load. That means the implementation needs to behave like a WQ_UNBOUND workqueue, just with a more restrictive cpumask. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx