Hi Tejun, all, in our work for reducing bfq overhead, we bumped into an unexpected fact: the functions blkg_*stats_*, invoked in bfq to update cgroups statistics as in cfq, take about 40% of the total execution time of bfq. This causes an additional serious slowdown on any multicore cpu, as most bfq functions, from which blkg_*stats_* get invoked, are protected by a per-device scheduler lock. To give you an idea, on an Intel i7-4850HQ, and with 8 threads doing random I/O in parallel on null_blk (configured with 0 latency), if the update of groups stats is removed, then the throughput grows from 260 to 404 KIOPS. This and all the other results we might share in this thread can be reproduced very easily with a (useful) script made by Luca Miccio [1]. We tried to understand the reason for this high overhead, and, in particular, to find out whether whether there was some issue that we could address on our own. But the causes seem somehow substantial: one of the most time-consuming operations needed by some blkg_*stats_* functions is, e.g., find_next_bit, for which we don't see any trivial replacement. So, as a first attempt to reduce this severe slowdown, we have made a patch that moves the invocation of blkg_*stats_* functions outside the critical sections protected by the bfq lock. Still, these functions apparently need to be protected with the request_queue lock, because the group they are invoked on may otherwise disappear before or while these functions are executed. Fortunately, tests run without even this lock have shown that the serialization caused by this lock has a little impact (5% of throughput reduction). As for results, moving these functions outside the bfq lock does improve throughput: it grows, e.g., from 260 to 316 KIOPS in the above test case. But we are still rather far from the optimum. Do you have any clue about possible solutions to reduce the overhead of these functions? If no relatively quick solution is available, we are planning to prepare, in addition to the above patch to increase parallelism, a further patch to give the user the possibility to disable stats update, so as to gain a full throughput boost of up to 55% (according to the tests we have run so far on a few different systems). Thanks, Paolo [1] https://github.com/Algodev-github/IOSpeed