Adding again Ulf, Linus and all the others, because Tejun replied to my initial email, which did not include them yet as recipients. > Il giorno 18 ott 2017, alle ore 17:02, Paolo Valente <paolo.valente@xxxxxxxxxx> ha scritto: > >> >> Il giorno 18 ott 2017, alle ore 15:19, Tejun Heo <tj@xxxxxxxxxx> ha scritto: >> >> Hello, Paolo. >> >> On Tue, Oct 17, 2017 at 12:11:01PM +0200, Paolo Valente wrote: >> ... >>> protected by a per-device scheduler lock. To give you an idea, on an >>> Intel i7-4850HQ, and with 8 threads doing random I/O in parallel on >>> null_blk (configured with 0 latency), if the update of groups stats is >>> removed, then the throughput grows from 260 to 404 KIOPS. This and >>> all the other results we might share in this thread can be reproduced >>> very easily with a (useful) script made by Luca Miccio [1]. >> >> I don't think the old request_queue is ever built for multiple CPUs >> hitting on a mem-backed device. >> >>> We tried to understand the reason for this high overhead, and, in >>> particular, to find out whether whether there was some issue that we >>> could address on our own. But the causes seem somehow substantial: >>> one of the most time-consuming operations needed by some blkg_*stats_* >>> functions is, e.g., find_next_bit, for which we don't see any trivial >>> replacement. >> >> Can you point to the specific ones? I can't find find_next_bit usages >> in generic blkg code. >> > > Yes, sorry for being too generic in the first place (fear to write too > much). > > I have attached a flame graph (made by Luca), showing all involved > functions. Look, e.g., for the blkg_*stat_* functions invoked > indirectly by bfq_dispatch_request, inside any of the worker > processes. As I already wrote, find_next_bit seems to be only part of > the cost of these functions (although an important part). > > You can obtain/reproduce the information in the flame graph (on a 8 > logical-core cpu), by invoking > > perf record -g -a —callgraph dwarf -F 999 > > and, in parallel, > > sudo ./IO_sched-speedtest.sh 20 8 bfq randread > > where IO_sched-speedtest.sh is the script I mentioned in my previous > email [1] > > [1] https://github.com/Algodev-github/IOSpeed > >>> So, as a first attempt to reduce this severe slowdown, we have made a >>> patch that moves the invocation of blkg_*stats_* functions outside the >>> critical sections protected by the bfq lock. Still, these functions >>> apparently need to be protected with the request_queue lock, because >> >> blkgs are already protected with RCU, so RCU protection should be >> enough. >> > > blkgs are, but the blkg_stat objects passed to the blkg_*stat_* > functions by bfq are not. In particular, these objects are contained > in bfq_group objects. Anyway, as I wrote, the cost of using the > request_queue lock seems to be a loss of about 5% of the throughput. > So, I guess that replacing this lock with RCU protection would > probably reduce this loss to only 2% or 3%. I wonder whether such a > gain would be worth the additional conceptual complexity of RCU; at > least untill the major problem, i.e,, the apparently high cost of stat > update, is solved somehow. > > Thanks, > Paolo > >> Thanks. >> >> -- >> tejun > > <bfq-tracing-cgroup.svg>