> Il giorno 6 ott 2021, alle ore 19:31, Jan Kara <jack@xxxxxxx> ha scritto: > > Hello! > > Here is the third revision of my patches to fix how bfq weights apply on cgroup > throughput and on throughput of processes with different IO priorities. Since > v2 I've added some more patches so that now IO priorities also result in > service differentiation (previously they had no effect on service > differentiation on some workloads similarly to cgroup weights). The last patch > in the series still needs some work as in the current state it causes a > notable regression (~20-30%) with dbench benchmark for large numbers of > clients. I've verified that the last patch is indeed necessary for the service > differentiation with the workload described in its changelog. As we discussed > with Paolo, I have also found out that if I remove the "waker has enough > budget" condition from bfq_select_queue(), dbench performance is restored > and the service differentiation is still good. But we probably need some > justification or cleaner solution than just removing the condition so that > is still up to discussion. But first seven patches already noticeably improve > the situation for lots of workloads so IMO they stand on their own and > can be merged regardless of how we go about the last patch. > Hi Jan, I have just one more (easy-to-resolve) doubt: you seem to have tested these patches mostly on the throughput side. Did you run a startup-latency test as well? I can run some for you, if you prefer so. Just give me a few days. When that is cleared, your first seven patches are ok for me. Actually I think they are a very good and relevant contribution. Patch number eight probably deserve some ore analysis, as you pointed out. As I already told you in our call, we can look at that budget condition together. And figure out the best tests to use, to check whether I/O control does not get lost too much. Thanks, Paolo > Changes since v2: > * Rebased on top of current Linus' tree > * Updated computation of scheduler tag proportions to work correctly even > for processes within the same cgroup but with different IO priorities > * Added comment roughly explaining why we limit tag depth > * Added patch limiting waker / wakee detection in time so avoid at least the > most obvious false positives > * Added patch to log waker / wakee detections in blktrace for better debugging > * Added patch properly account injected IO > > Changes since v1: > * Fixed computation of appropriate proportion of scheduler tags for a cgroup > to work with deeper cgroup hierarchies. > > Original cover letter: > > I was looking into why cgroup weights do not have any measurable impact on > writeback throughput from different cgroups. This actually a regression from > CFQ where things work more or less OK and weights have roughly the impact they > should. The problem can be reproduced e.g. by running the following easy fio > job in two cgroups with different weight: > > [writer] > directory=/mnt/repro/ > numjobs=1 > rw=write > size=8g > time_based > runtime=30 > ramp_time=10 > blocksize=1m > direct=0 > ioengine=sync > > I can observe there's no significat difference in the amount of data written > from different cgroups despite their weights are in say 1:3 ratio. > > After some debugging I've understood the dynamics of the system. There are two > issues: > > 1) The amount of scheduler tags needs to be significantly larger than the > amount of device tags. Otherwise there are not enough requests waiting in BFQ > to be dispatched to the device and thus there's nothing to schedule on. > > 2) Even with enough scheduler tags, writers from two cgroups eventually start > contending on scheduler tag allocation. These are served on first come first > served basis so writers from both cgroups feed requests into bfq with > approximately the same speed. Since bfq prefers IO from heavier cgroup, that is > submitted and completed faster and eventually we end up in a situation when > there's no IO from the heavier cgroup in bfq and all scheduler tags are > consumed by requests from the lighter cgroup. At that point bfq just dispatches > lots of the IO from the lighter cgroup since there's no contender for disk > throughput. As a result observed throughput for both cgroups are the same. > > This series fixes this problem by accounting how many scheduler tags are > allocated for each cgroup and if a cgroup has more tags allocated than its > fair share (based on weights) in its service tree, we heavily limit scheduler > tag bitmap depth for it so that it is not be able to starve other cgroups from > scheduler tags. > > What do people think about this? > > Honza > > Previous versions: > Link: http://lore.kernel.org/r/20210712171146.12231-1-jack@xxxxxxx # v1 > Link: http://lore.kernel.org/r/20210715132047.20874-1-jack@xxxxxxx # v2