> Il giorno 7 ott 2021, alle ore 18:33, Paolo Valente <paolo.valente@xxxxxxxxxx> ha scritto: > > > >> Il giorno 6 ott 2021, alle ore 19:31, Jan Kara <jack@xxxxxxx> ha scritto: >> >> Hello! >> >> Here is the third revision of my patches to fix how bfq weights apply on cgroup >> throughput and on throughput of processes with different IO priorities. Since >> v2 I've added some more patches so that now IO priorities also result in >> service differentiation (previously they had no effect on service >> differentiation on some workloads similarly to cgroup weights). The last patch >> in the series still needs some work as in the current state it causes a >> notable regression (~20-30%) with dbench benchmark for large numbers of >> clients. I've verified that the last patch is indeed necessary for the service >> differentiation with the workload described in its changelog. As we discussed >> with Paolo, I have also found out that if I remove the "waker has enough >> budget" condition from bfq_select_queue(), dbench performance is restored >> and the service differentiation is still good. But we probably need some >> justification or cleaner solution than just removing the condition so that >> is still up to discussion. But first seven patches already noticeably improve >> the situation for lots of workloads so IMO they stand on their own and >> can be merged regardless of how we go about the last patch. >> > > Hi Jan, > I have just one more (easy-to-resolve) doubt: you seem to have tested > these patches mostly on the throughput side. Did you run a > startup-latency test as well? I can run some for you, if you prefer > so. Just give me a few days. > We are finally testing your patches a little bit right now, for regressions with our typical benchmarks ... Paolo > When that is cleared, your first seven patches are ok for me. > Actually I think they are a very good and relevant contribution. > Patch number eight probably deserve some ore analysis, as you pointed > out. As I already told you in our call, we can look at that budget > condition together. And figure out the best tests to use, to check > whether I/O control does not get lost too much. > > Thanks, > Paolo > >> Changes since v2: >> * Rebased on top of current Linus' tree >> * Updated computation of scheduler tag proportions to work correctly even >> for processes within the same cgroup but with different IO priorities >> * Added comment roughly explaining why we limit tag depth >> * Added patch limiting waker / wakee detection in time so avoid at least the >> most obvious false positives >> * Added patch to log waker / wakee detections in blktrace for better debugging >> * Added patch properly account injected IO >> >> Changes since v1: >> * Fixed computation of appropriate proportion of scheduler tags for a cgroup >> to work with deeper cgroup hierarchies. >> >> Original cover letter: >> >> I was looking into why cgroup weights do not have any measurable impact on >> writeback throughput from different cgroups. This actually a regression from >> CFQ where things work more or less OK and weights have roughly the impact they >> should. The problem can be reproduced e.g. by running the following easy fio >> job in two cgroups with different weight: >> >> [writer] >> directory=/mnt/repro/ >> numjobs=1 >> rw=write >> size=8g >> time_based >> runtime=30 >> ramp_time=10 >> blocksize=1m >> direct=0 >> ioengine=sync >> >> I can observe there's no significat difference in the amount of data written >> from different cgroups despite their weights are in say 1:3 ratio. >> >> After some debugging I've understood the dynamics of the system. There are two >> issues: >> >> 1) The amount of scheduler tags needs to be significantly larger than the >> amount of device tags. Otherwise there are not enough requests waiting in BFQ >> to be dispatched to the device and thus there's nothing to schedule on. >> >> 2) Even with enough scheduler tags, writers from two cgroups eventually start >> contending on scheduler tag allocation. These are served on first come first >> served basis so writers from both cgroups feed requests into bfq with >> approximately the same speed. Since bfq prefers IO from heavier cgroup, that is >> submitted and completed faster and eventually we end up in a situation when >> there's no IO from the heavier cgroup in bfq and all scheduler tags are >> consumed by requests from the lighter cgroup. At that point bfq just dispatches >> lots of the IO from the lighter cgroup since there's no contender for disk >> throughput. As a result observed throughput for both cgroups are the same. >> >> This series fixes this problem by accounting how many scheduler tags are >> allocated for each cgroup and if a cgroup has more tags allocated than its >> fair share (based on weights) in its service tree, we heavily limit scheduler >> tag bitmap depth for it so that it is not be able to starve other cgroups from >> scheduler tags. >> >> What do people think about this? >> >> Honza >> >> Previous versions: >> Link: http://lore.kernel.org/r/20210712171146.12231-1-jack@xxxxxxx # v1 >> Link: http://lore.kernel.org/r/20210715132047.20874-1-jack@xxxxxxx # v2