Re: high overhead of functions blkg_*stats_* in bfq

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Adding again Ulf, Linus and all the others, because Tejun replied to my initial email, which did not include them yet as recipients.

> Il giorno 18 ott 2017, alle ore 17:02, Paolo Valente <paolo.valente@xxxxxxxxxx> ha scritto:
> 
>> 
>> Il giorno 18 ott 2017, alle ore 15:19, Tejun Heo <tj@xxxxxxxxxx> ha scritto:
>> 
>> Hello, Paolo.
>> 
>> On Tue, Oct 17, 2017 at 12:11:01PM +0200, Paolo Valente wrote:
>> ...
>>> protected by a per-device scheduler lock.  To give you an idea, on an
>>> Intel i7-4850HQ, and with 8 threads doing random I/O in parallel on
>>> null_blk (configured with 0 latency), if the update of groups stats is
>>> removed, then the throughput grows from 260 to 404 KIOPS.  This and
>>> all the other results we might share in this thread can be reproduced
>>> very easily with a (useful) script made by Luca Miccio [1].
>> 
>> I don't think the old request_queue is ever built for multiple CPUs
>> hitting on a mem-backed device.
>> 
>>> We tried to understand the reason for this high overhead, and, in
>>> particular, to find out whether whether there was some issue that we
>>> could address on our own.  But the causes seem somehow substantial:
>>> one of the most time-consuming operations needed by some blkg_*stats_*
>>> functions is, e.g., find_next_bit, for which we don't see any trivial
>>> replacement.
>> 
>> Can you point to the specific ones?  I can't find find_next_bit usages
>> in generic blkg code.
>> 
> 
> Yes, sorry for being too generic in the first place (fear to write too
> much).
> 
> I have attached a flame graph (made by Luca), showing all involved
> functions.  Look, e.g., for the blkg_*stat_* functions invoked
> indirectly by bfq_dispatch_request, inside any of the worker
> processes.  As I already wrote, find_next_bit seems to be only part of
> the cost of these functions (although an important part).
> 
> You can obtain/reproduce the information in the flame graph (on a 8
> logical-core cpu), by invoking
> 
> perf record -g -a —callgraph dwarf -F 999
> 
> and, in parallel,
> 
> sudo ./IO_sched-speedtest.sh 20 8 bfq randread
> 
> where IO_sched-speedtest.sh is the script I mentioned in my previous
> email [1]
> 
> [1] https://github.com/Algodev-github/IOSpeed
> 
>>> So, as a first attempt to reduce this severe slowdown, we have made a
>>> patch that moves the invocation of blkg_*stats_* functions outside the
>>> critical sections protected by the bfq lock.  Still, these functions
>>> apparently need to be protected with the request_queue lock, because
>> 
>> blkgs are already protected with RCU, so RCU protection should be
>> enough.
>> 
> 
> blkgs are, but the blkg_stat objects passed to the blkg_*stat_*
> functions by bfq are not.  In particular, these objects are contained
> in bfq_group objects.  Anyway, as I wrote, the cost of using the
> request_queue lock seems to be a loss of about 5% of the throughput.
> So, I guess that replacing this lock with RCU protection would
> probably reduce this loss to only 2% or 3%.  I wonder whether such a
> gain would be worth the additional conceptual complexity of RCU; at
> least untill the major problem, i.e,, the apparently high cost of stat
> update, is solved somehow.
> 
> Thanks,
> Paolo
> 
>> Thanks.
>> 
>> -- 
>> tejun
> 
> <bfq-tracing-cgroup.svg>





[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux