On Wed, Apr 26, 2023 at 11:31:00AM -1000, Tejun Heo wrote: > Hello, > > On Wed, Apr 26, 2023 at 02:17:03PM -0700, Paul E. McKenney wrote: > > But the idea here is to spread the load of queueing the work as well as > > spreading the load of invoking the callbacks. > > > > I suppose that I could allocate an array of ints, gather the online CPUs > > into that array, and do a power-of-two distribution across that array. > > But RCU Tasks allows CPUs to go offline with queued callbacks, so this > > array would also need to include those CPUs as well as the ones that > > are online. > > Ah, I see, so it needs to make the distinction between cpus which have never > been online and are currently offline but used to be online. But only for as long as the used-to-be-online CPUs have callbacks for the corresponding flavor of Tasks RCU. :-/ > > Given that the common-case system has a dense cpus_online_mask, I opted > > to keep it simple, which is optimal in the common case. > > > > Or am I missing a trick here? > > The worry is that on systems with actual CPU hotplugging, cpu_online_mask > can be pretty sparse - e.g. 1/4 filled wouldn't be too out there. In such > cases, the current code would end scheduling the work items on the issuing > CPU (which is what WORK_CPU_UNBOUND does) 3/4 of the time which probably > isn't the desired behavior. > > So, I can initialize all per-cpu workqueues for all possible cpus on boot so > that rcu doesn't have to worry about it but that would still have a similar > problem of the callbacks not really being spread as intended. Unless you get a few more users that care about this, it is probably best to just let RCU deal with it. For whatever it is worth, I am working a smaller patch that doesn't need to do cpus_read_lock(), but anyone with short-term needs should stick with the existing patch. > I think it depends on how important it is to spread the callback workload > evenly. If that matters quite a bit, it probably would make sense to > maintain a cpumask for has-ever-been-online CPUs. Otherwise, do you think it > can just use an unbound workqueue and forget about manually distributing the > workload? If there are not very many callbacks, then you are right that spreading the load makes no sense. And the 18-months-ago version of this code in fact didn't bother spreading. But new workloads came up that cared about update-side performance and scalability, which led to the current code. This code initially just invokes all the callbacks directly, just like it did unconditionally 18 months ago, due to ->percpu_dequeue_lim being initialized to 1. This causes all the RCU Tasks callbacks to be queued on CPU 0 and to be invoked directly by the grace-period kthread. But if the call_rcu_tasks_*() code detects too much lock contention on CPU 0's queue, which indicates that very large numbers of callbacks are being queued, it switches to per-CPU mode. In which case, we are likely to have lots of callbacks on lots of queues, and in that case we really want to invoke them concurrently. Then if a later grace period finds that there are no more callbacks, it switches back to CPU-0 mode. So this extra workqueue overhead should happen only on systems with sparse cpu_online_masks that are under heavy call_rcu_tasks_*() load. That is the theory, anyway! ;-) Thanx, Paul