Re: [PATCH 0/8 v3] bfq: Limit number of allocated scheduler tags per cgroup

Paolo Valente <paolo.valente@xxxxxxxxxx> · Mon, 25 Oct 2021 09:58:11 +0200

> Il giorno 7 ott 2021, alle ore 18:33, Paolo Valente <paolo.valente@xxxxxxxxxx> ha scritto:
> 
> 
> 
>> Il giorno 6 ott 2021, alle ore 19:31, Jan Kara <jack@xxxxxxx> ha scritto:
>> 
>> Hello!
>> 
>> Here is the third revision of my patches to fix how bfq weights apply on cgroup
>> throughput and on throughput of processes with different IO priorities. Since
>> v2 I've added some more patches so that now IO priorities also result in
>> service differentiation (previously they had no effect on service
>> differentiation on some workloads similarly to cgroup weights). The last patch
>> in the series still needs some work as in the current state it causes a
>> notable regression (~20-30%) with dbench benchmark for large numbers of
>> clients. I've verified that the last patch is indeed necessary for the service
>> differentiation with the workload described in its changelog. As we discussed
>> with Paolo, I have also found out that if I remove the "waker has enough
>> budget" condition from bfq_select_queue(), dbench performance is restored
>> and the service differentiation is still good. But we probably need some
>> justification or cleaner solution than just removing the condition so that
>> is still up to discussion. But first seven patches already noticeably improve
>> the situation for lots of workloads so IMO they stand on their own and
>> can be merged regardless of how we go about the last patch.
>> 
> 
> Hi Jan,
> I have just one more (easy-to-resolve) doubt: you seem to have tested
> these patches mostly on the throughput side.  Did you run a
> startup-latency test as well?  I can run some for you, if you prefer
> so. Just give me a few days.
> 

We are finally testing your patches a little bit right now, for
regressions with our typical benchmarks ...

Paolo

> When that is cleared, your first seven patches are ok for me.
> Actually I think they are a very good and relevant contribution.
> Patch number eight probably deserve some ore analysis, as you pointed
> out.  As I already told you in our call, we can look at that budget
> condition together.  And figure out the best tests to use, to check
> whether I/O control does not get lost too much.
> 
> Thanks,
> Paolo
> 
>> Changes since v2:
>> * Rebased on top of current Linus' tree
>> * Updated computation of scheduler tag proportions to work correctly even
>> for processes within the same cgroup but with different IO priorities
>> * Added comment roughly explaining why we limit tag depth
>> * Added patch limiting waker / wakee detection in time so avoid at least the
>> most obvious false positives
>> * Added patch to log waker / wakee detections in blktrace for better debugging
>> * Added patch properly account injected IO
>> 
>> Changes since v1:
>> * Fixed computation of appropriate proportion of scheduler tags for a cgroup
>> to work with deeper cgroup hierarchies.
>> 
>> Original cover letter:
>> 
>> I was looking into why cgroup weights do not have any measurable impact on
>> writeback throughput from different cgroups. This actually a regression from
>> CFQ where things work more or less OK and weights have roughly the impact they
>> should. The problem can be reproduced e.g. by running the following easy fio
>> job in two cgroups with different weight:
>> 
>> [writer]
>> directory=/mnt/repro/
>> numjobs=1
>> rw=write
>> size=8g
>> time_based
>> runtime=30
>> ramp_time=10
>> blocksize=1m
>> direct=0
>> ioengine=sync
>> 
>> I can observe there's no significat difference in the amount of data written
>> from different cgroups despite their weights are in say 1:3 ratio.
>> 
>> After some debugging I've understood the dynamics of the system. There are two
>> issues:
>> 
>> 1) The amount of scheduler tags needs to be significantly larger than the
>> amount of device tags. Otherwise there are not enough requests waiting in BFQ
>> to be dispatched to the device and thus there's nothing to schedule on.
>> 
>> 2) Even with enough scheduler tags, writers from two cgroups eventually start
>> contending on scheduler tag allocation. These are served on first come first
>> served basis so writers from both cgroups feed requests into bfq with
>> approximately the same speed. Since bfq prefers IO from heavier cgroup, that is
>> submitted and completed faster and eventually we end up in a situation when
>> there's no IO from the heavier cgroup in bfq and all scheduler tags are
>> consumed by requests from the lighter cgroup. At that point bfq just dispatches
>> lots of the IO from the lighter cgroup since there's no contender for disk
>> throughput. As a result observed throughput for both cgroups are the same.
>> 
>> This series fixes this problem by accounting how many scheduler tags are
>> allocated for each cgroup and if a cgroup has more tags allocated than its
>> fair share (based on weights) in its service tree, we heavily limit scheduler
>> tag bitmap depth for it so that it is not be able to starve other cgroups from
>> scheduler tags.
>> 
>> What do people think about this?
>> 
>> 								Honza
>> 
>> Previous versions:
>> Link: http://lore.kernel.org/r/20210712171146.12231-1-jack@xxxxxxx # v1
>> Link: http://lore.kernel.org/r/20210715132047.20874-1-jack@xxxxxxx # v2