Tejun Heo <tj@xxxxxxxxxx> writes: > This patchset implements IO cost model based work-conserving > proportional controller. > > While io.latency provides the capability to comprehensively prioritize > and protect IOs depending on the cgroups, its protection is binary - > the lowest latency target cgroup which is suffering is protected at > the cost of all others. In many use cases including stacking multiple > workload containers in a single system, it's necessary to distribute > IO capacity with better granularity. > > One challenge of controlling IO resources is the lack of trivially > observable cost metric. The most common metrics - bandwidth and iops > - can be off by orders of magnitude depending on the device type and > IO pattern. However, the cost isn't a complete mystery. Given > several key attributes, we can make fairly reliable predictions on how > expensive a given stream of IOs would be, at least compared to other > IO patterns. > > The function which determines the cost of a given IO is the IO cost > model for the device. This controller distributes IO capacity based > on the costs estimated by such model. The more accurate the cost > model the better but the controller adapts based on IO completion > latency and as long as the relative costs across differents IO > patterns are consistent and sensible, it'll adapt to the actual > performance of the device. > > Currently, the only implemented cost model is a simple linear one with > a few sets of default parameters for different classes of device. > This covers most common devices reasonably well. All the > infrastructure to tune and add different cost models is already in > place and a later patch will also allow using bpf progs for cost > models. > > Please see the top comment in blk-ioweight.c and documentation for > more details. Reading through the description here and in the comment, and with the caveat that I am familiar with network packet scheduling but not with the IO layer, I think your approach sounds quite reasonable; and I'm happy to see improvements in this area! One question: How are equal-weight cgroups scheduled relative to each other? Or requests from different processes within a single cgroup for that matter? FIFO? Round-robin? Something else? Thanks, -Toke