On 8/26/21 12:13 PM, Jens Axboe wrote: > On 8/26/21 12:09 PM, Bart Van Assche wrote: >> On 8/26/21 7:40 AM, Zhen Lei wrote: >>> lock protection needs to be added only in dd_finish_request(), which >>> is unlikely to cause significant performance side effects. >> >> Not sure the above is correct. Every new atomic instruction has a >> measurable performance overhead. But I guess in this case that >> overhead is smaller than the time needed to sum 128 per-CPU variables. > > perpcu counters only really work, if the summing is not in a hot path, > or if the summing is just some "not zero" thing instead of a full sum. > They just don't scale at all for even moderately sized systems. Ugh it's actually even worse in this case, since you do: static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio) { return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio); } which ends up iterating possible CPUs _twice_! Just ran a quick test here, and I go from 3.55M IOPS to 1.23M switching to deadline, of which 37% of the overhead is from dd_dispatch(). With the posted patch applied, it runs at 2.3M IOPS with mq-deadline, which is a lot better. This is on my 3970X test box, so 32 cores, 64 threads. Bart, either we fix this up ASAP and get rid of the percpu counters in the hot path, or we revert this patch. -- Jens Axboe